Speeding up JSON parsing in Go - json

We have transaction log files in which each transaction is a single line in JSON format. We often need to take selected parts of the data, perform a single time conversion, and feed results into another system in a specific format. I wrote a Python script that does this as we need, but I hoped that Go would be faster, and would give me a chance to start learning Go. So, I wrote the following:
package main
import "encoding/json"
import "fmt"
import "time"
import "bufio"
import "os"
func main() {
sep := ","
reader := bufio.NewReader(os.Stdin)
for {
data, _ := reader.ReadString('\n')
byt := []byte(data)
var dat map[string]interface{}
if err := json.Unmarshal(byt, &dat); err != nil {
break
}
status := dat["status"].(string)
a_status := dat["a_status"].(string)
method := dat["method"].(string)
path := dat["path"].(string)
element_uid := dat["element_uid"].(string)
time_local := dat["time_local"].(string)
etime, _ := time.Parse("[02/Jan/2006:15:04:05 -0700]", time_local)
fmt.Print(status, sep, a_status, sep, method, sep, path, sep, element_uid, sep, etime.Unix(), "\n")
}
}
That compiles without complaint, but I'm surprised at the lack of performance improvement. To test, I placed 2,000,000 lines of logs into a tmpfs (to ensure that disk I/O would not be a limitation) and compared the two versions of the script. My results:
$ time cat /mnt/ramdisk/logfile | ./stdin_conv > /dev/null
real 0m51.995s
$ time cat /mnt/ramdisk/logfile | ./stdin_conv.py > /dev/null
real 0m52.471s
$ time cat /mnt/ramdisk/logfile > /dev/null
real 0m0.149s
How can this be made faster? I have made some rudimentary efforts. The ffjson project, for example, proposes to create static functions that make reflection unnecessary; however, I have failed so far to get it to work, getting the error:
Error: Go Run Failed for: /tmp/ffjson-inception810284909.go
STDOUT:
STDERR:
/tmp/ffjson-inception810284909.go:9:2: import "json_parse" is a program, not an importable package
:
Besides, wouldn't what I have above be considered statically typed? Possibly not-- I am positively dripping behind the ears where Go is concerned. I have tried selectively disabling different attributes in the Go code to see if one is especially problematic. None have had an appreciable effect on performance. Any suggestions on improving performance, or is this simply a case where compiled languages have no substantial benefit over others?

Try using a type to remove all this unnecessary assignment and type assertion;
type RenameMe struct {
Status string `json:"status"`
Astatus string `json:"a_status"`
Method string `json:"method"`
Path string `json:"path"`
ElementUid string `json:"element_uid"`
TimeLocal time.Time `json:"time_local"`
Etime time.Time // deal with this after the fact
}
data := &RenameMe{}
if err := json.Unmarshal(byt, data); err != nil {
break
}
data.Etime, _ := time.Parse("[02/Jan/2006:15:04:05 -0700]", time_local)
I'm not going to test this to ensure it outperforms your code but I bet it does by a large margin. Give it a try and let me know please.

http://jsoniter.com/ declares itself to be the fastest json parser, golang and java implementations are provided. Two types of api can be used. And pre-injected json object definition is optional.

Check https://github.com/pquerna/ffjson
I saw 3x improvements over the standard json marshal/unmarshal method employed by the standard lib. It does so by rewrite the source and removing the need for reflection.

Related

Quicker way to deepcopy objects in golang, JSON vs gob

I am using go 1.9. And I want to deepcopy value of object into another object. I try to do it with encoding/gob and encoding/json. But it takes more time for gob encoding than json encoding. I see some other questions like this and they suggest that gob encoding should be quicker. But I see exact opposite behaviour. Can someone tell me if I am doing something wrong? Or any better and quicker way to deepcopy than these two? My object's struct is complex and nested.
The test code:
package main
import (
"bytes"
"encoding/gob"
"encoding/json"
"log"
"time"
"strconv"
)
// Test ...
type Test struct {
Prop1 int
Prop2 string
}
// Clone deep-copies a to b
func Clone(a, b interface{}) {
buff := new(bytes.Buffer)
enc := gob.NewEncoder(buff)
dec := gob.NewDecoder(buff)
enc.Encode(a)
dec.Decode(b)
}
// DeepCopy deepcopies a to b using json marshaling
func DeepCopy(a, b interface{}) {
byt, _ := json.Marshal(a)
json.Unmarshal(byt, b)
}
func main() {
i := 0
tClone := time.Duration(0)
tCopy := time.Duration(0)
end := 3000
for {
if i == end {
break
}
r := Test{Prop1: i, Prop2: strconv.Itoa(i)}
var rNew Test
t0 := time.Now()
Clone(r, &rNew)
t2 := time.Now().Sub(t0)
tClone += t2
r2 := Test{Prop1: i, Prop2: strconv.Itoa(i)}
var rNew2 Test
t0 = time.Now()
DeepCopy(&r2, &rNew2)
t2 = time.Now().Sub(t0)
tCopy += t2
i++
}
log.Printf("Total items %+v, Clone avg. %+v, DeepCopy avg. %+v, Total Difference %+v\n", i, tClone/3000, tCopy/3000, (tClone - tCopy))
}
I get following output:
Total items 3000, Clone avg. 30.883µs, DeepCopy avg. 6.747µs, Total Difference 72.409084ms
JSON vs gob difference
The encoding/gob package needs to transmit type definitions:
The implementation compiles a custom codec for each data type in the stream and is most efficient when a single Encoder is used to transmit a stream of values, amortizing the cost of compilation.
When you "first" serialize a value of a type, the definition of the type also has to be included / transmitted, so the decoder can properly interpret and decode the stream:
A stream of gobs is self-describing. Each data item in the stream is preceded by a specification of its type, expressed in terms of a small set of predefined types.
This is explained in great details here: Efficient Go serialization of struct to disk
So while in your case it's necessary to create a new gob encoder and decoder each time, it is still the "bottleneck", the part that makes it slow. Encoding to / decoding from JSON format, type description is not included in the representation.
To prove it, make this simple change:
type Test struct {
Prop1 [1000]int
Prop2 [1000]string
}
What we did here is made the types of fields arrays, "multiplying" the values a thousand times, while the type information is effectively remained the same (all elements in the arrays have the same type). Creating values of them like this:
r := Test{Prop1: [1000]int{}, Prop2: [1000]string{}}
Now running your test program, the output on my machine:
Original:
2017/10/17 14:55:53 Total items 3000, Clone avg. 33.63µs, DeepCopy avg. 2.326µs, Total Difference 93.910918ms
Modified version:
2017/10/17 14:56:38 Total items 3000, Clone avg. 119.899µs, DeepCopy avg. 462.608µs, Total Difference -1.02812648s
As you can see, in the original version JSON is faster, but in the modified version gob became faster, as the cost of transmitting type info amortized.
Testing / benching method
Now on to your testing method. This way of measuring performance is bad and can yield quite inaccurate results. Instead you should use Go's built-in testing and benchmark tools. For details, read Order of the code and performance.
Caveats of these cloning
These methods work with reflection and thus can only "clone" fields that are accessible via reflection, that is: exported. Also they often don't manage pointer equality. By this I mean if you have 2 pointer fields in a struct, both pointing to the same object (pointers being equal), after marshaling and unmarshaling, you'll get 2 different pointers pointing to 2 different values. This may even cause problems in certain situations. They also don't handle self-referencing structures, which at best returns an error, or in wrose case causes an infinite loop or goroutine stack exceeding.
The "proper" way of cloning
Considering the caveats mentioned above, often the proper way of cloning needs help from the "inside". That is, cloning a specific type is often only possible if that type (or the package of that type) provides this functionality.
Yes, providing a "manual" cloning functionality is not convenient, but on the other side it will outperform the above methods (maybe even by orders of magnitude), and will require the least amount of "working" memory required for the cloning process.

Why does json.Encoder add an extra line?

json.Encoder seems to behave slightly different than json.Marshal. Specifically it adds a new line at the end of the encoded value. Any idea why is that? It looks like a bug to me.
package main
import "fmt"
import "encoding/json"
import "bytes"
func main() {
var v string
v = "hello"
buf := bytes.NewBuffer(nil)
json.NewEncoder(buf).Encode(v)
b, _ := json.Marshal(&v)
fmt.Printf("%q, %q", buf.Bytes(), b)
}
This outputs
"\"hello\"\n", "\"hello\""
Try it in the Playground
Because they explicitly added a new line character when using Encoder.Encode. Here's the source code to that func, and it actually states it adds a newline character in the documentation (see comment, which is the documentation):
https://golang.org/src/encoding/json/stream.go?s=4272:4319
// Encode writes the JSON encoding of v to the stream,
// followed by a newline character.
//
// See the documentation for Marshal for details about the
// conversion of Go values to JSON.
func (enc *Encoder) Encode(v interface{}) error {
if enc.err != nil {
return enc.err
}
e := newEncodeState()
err := e.marshal(v)
if err != nil {
return err
}
// Terminate each value with a newline.
// This makes the output look a little nicer
// when debugging, and some kind of space
// is required if the encoded value was a number,
// so that the reader knows there aren't more
// digits coming.
e.WriteByte('\n')
if _, err = enc.w.Write(e.Bytes()); err != nil {
enc.err = err
}
encodeStatePool.Put(e)
return err
}
Now, why did the Go developers do it other than "makes the output look a little nice"? One answer:
Streaming
The go json Encoder is optimized for streaming (e.g. MB/GB/PB of json data). It is typical that when streaming you need a way to deliminate when your stream has completed. In the case of Encoder.Encode(), that is a \n newline character. Sure, you can certainly write to a buffer. But you can also write to an io.Writer which would stream the block of v.
This is opposed to the use of json.Marshal which is generally discouraged if your input is from an untrusted (and unknown limited) source (e.g. an ajax POST method to your web service - what if someone posts a 100MB json file?). And, json.Marshal would be a final complete set of json - e.g. you wouldn't expect to concatenate a few 100 Marshal entries together. You'd use Encoder.Encode() for that to build a large set and write to the buffer, stream, file, io.Writer, etc.
Whenever in doubt if it's a bug, I always lookup the source - that's one of the advantages to Go, it's source and compiler is just pure Go. Within [n]vim I use \gb to open the source definition in a browser with my .vimrc settings.
You can erease the newline by backward stream:
f, _ := os.OpenFile(fname, ...)
encoder := json.NewEncoder(f)
encoder.Encode(v)
f.Seek(-1, 1)
f.WriteString("other data ...")
They should let user control this strange behavior:
a build option to disable it
Encoder.SetEOF(eof string)
Encoder.SetIndent(prefix, indent, eof string)
The Encoder writes a stream of documents. The extra whitespace terminates a JSON document in the stream.
A terminator is required for stream readers. Consider a stream containing these JSON documents: 1, 2, 3. Without the extra whitespace, the data on the wire is the sequence of bytes 123. This is a single JSON document with the number 123, not three documents.

How do I do this? Get an struct send to a func as interface and get it back as struct?

I'm new in golang development and have some question regarding something related to this question.
As a learning exercise, I'm trying to create a simple library to handle json based configuration file. As a configuration file to be used for more then one app, it should be able to handle different parameters. Then I have created a type struct Configuration that has the filename and a data interface. Each app will have a struct based on its configuration needs.
In the code bellow, I put all together (lib and "main code") and the "TestData struct" is the "app parameters".
If it doesn't exists, it will set a default values and create the file, and it is working. But when I try to read the file. I try to decode the json and put it back into the data interface. But it is giving me an error and I couldn't figure out how to solve this. Can someone help on this?
[updated] I didn't put the targeted code before, because I though that it would be easier to read in in all as a single program. Bellow is the 'targeted code' for better view of the issue.
As I will not be able to use the TestData struct inside the library, since it will change from program to program, the only way to handle this was using interface. Is there a better way?
library config
package config
import (
"encoding/json"
"fmt"
"os"
)
// Base configuration struct
type Configuration struct {
Filename string
Data interface{}
}
func (c *Configuration) Create(cData *Configuration) bool {
cFile, err := os.Open(cData.Filename)
defer cFile.Close()
if err == nil {
fmt.Println("Error(1) trying to create a configuration file. File '", cData.Filename, "' may already exist...")
return false
}
cFile, err = os.Create(cData.Filename)
if err != nil {
fmt.Println("Error(2) trying to create a configuration file. File '", cData.Filename, "' may already exist...")
return false
}
buffer, _ := json.MarshalIndent(cData.Data, "", "")
cFile.Write(buffer)
return true
}
func (c *Configuration) Read(cData *Configuration) bool {
cFile, err := os.Open(cData.Filename)
defer cFile.Close()
if err != nil {
fmt.Println("Error(1) trying to read a configuration file. File '", cData.Filename, "' may not already exist...")
return false
}
jConfig := json.NewDecoder(cFile)
jerr := jConfig.Decode(&cData.Data)
if jerr != nil {
panic(jerr)
}
return true
}
program using library config
package main
import (
"fmt"
"./config"
)
// struct basic para configuração
type TestData struct {
URL string
Port string
}
func main() {
var Config config.Configuration
Config.Filename = "config.json"
if !Config.Read(&Config) {
Config.Data = TestData{"http", "8080"}
Config.Create(&Config)
}
fmt.Println(Config.Data)
TestData1 := &TestData{}
TestData1 = Config.Data.(*TestData) // error, why?
fmt.Println(TestData1.URL)
}
NEW UPDATE:
I have made some changes after JimB comment about I'm not clear about some concepts and I tried to review it. Sure many things aren't clear for me yet unfortunately. The "big" understanding I believe I got, but what mess my mind up is the "ins" and "outs" of values and formats and pointers, mainly when it goes to other libraries. I'm not able yet to follow the "full path" of it.
Yet, I believe I had some improvement on my code.
I think that I have corrected some points, but still have some big questions:
I stopped sending "Configuration" as a parameter as all "data" were already there as they are "thenselfs" in the instance. Right?
Why do I have use reference in the line 58 (Config.Data = &TestData{})
Why to I have to use pointer in the line 64 (tmp := Config.Data.(*TestData)
Why I CANNOT use reference in line 69 (Config.Data = tmp)
Thanks
The reason you are running into an error is because you are trying to decode into an interface{} type. When dealing with JSON objects, they are decoded by the encoding/json package into map[string]interface{} types by default. This is causing the type assertion to fail since the memory structure for a map[string]interface{} is much different than that of a struct.
The better way to do this is to make your TestData struct the expected data format for your Configuration struct:
// Base configuration struct
type Configuration struct {
Filename string
Data *TestData
}
Then when Decoding the file data, the package will unmarshal the data into the fields that match the closest with the data it finds.
If you need more control over the data unmarshaling process, you can dictate which JSON fields get decoded into which struct members by using struct tags. You can read more about the json struct tags available here: https://golang.org/pkg/encoding/json/#Marshal
You are trying to assert that Config.Data is of type *TestData, but you're assigning it to TestData{"http", "8080"} above. You can take the address of a composite literal to create a pointer:
Config.Data = &TestData{"http", "8080"}
If your config already exsits, your Read method is going to fill in the Data field with the a default json data type, probably a map[string]interface{}. If you assign a pointer of the correct type to Data first, it will decode into the expected type.
Config.Data = &TestData{}
Ans since Data is an interface{}, you do not want to ever use a pointer to that value, so don't use the & operator when marshaling and unmarshaling.

How do I substitute bytes in a stream using Go standard library?

I have a io.Reader which I get from http.Request.Body that reads a JSON byte slice from a server.
I would like to stream this to json.NewDecoder. However I would also like to intercept the JSON before it hits json.NewDecoder and substitute certain parts of it. For example, the JSON string contains empty hashes "{}" which I would like to remove due to a bug in the server's JSON output.
I am currently achieving my goal using json.Unmarshal but not using the JSON streaming parser:
data, _ := ioutil.ReadAll(r.Body)
data = bytes.Replace(data, []byte("{}"), "", -1)
json.Unmarshal(data, [my struct])
How can I achieve the same thing as above but using json.NewDecoder so I can save the many times the above code has to parse through r.Body's data? Here's some code using a pseudo function ReplaceStream(r io.Reader, old, new []byte):
reader := ReplaceStream(r.Body, []byte("{}"), "")
dec := json.NewDecoder(reader)
dec.Decode([my struct])
I know ReplaceStream might be fairly trivial to make, but is there anything in the standard library to do this that I am unaware of?
My advice is to just treat that kind of message as a special case and avoid the extra parsing / substituting for all the other requests
data, _ := ioutil.ReadAll(r.Body)
// FIXME: overcome bug #12312 of json server
if data == `{"list": [{}]}` {
return []
}
// Normal datastruct ..

Golang: Parsing benchmarking between message pack and JSON

We are working on a TCP server which takes simple textbased commands over TCP (similar to redis)
We are tossing up between using raw text command, JSON or message pack (http://msgpack.org/)
An example of a command could be:
text command: LOCK some_random_key 1000
JSON command: {"command":"LOCK","key":"some_random_key","timeout":1000}
messagePack: \x83\xA7command\xA4LOCK\xA3key\xAFsome_random_key\xA7timeout\xCD\x03\xE8
Question:
EDIT: I have figured out my own question which is the speed comparison between parsing JSON and MsgPack. Please see results in my answer
Parsing Speed Comparison:
BenchmarkJSON 100000 17888 ns/op
BenchmarkMsgPack 200000 10432 ns/op
My benchmarking code:
package benchmark
import (
"encoding/json"
"github.com/vmihailenco/msgpack"
"testing"
)
var in = map[string]interface{}{"c": "LOCK", "k": "31uEbMgunupShBVTewXjtqbBv5MndwfXhb", "T/O": 1000, "max": 200}
func BenchmarkJSON(b *testing.B) {
for i := 0; i < b.N; i++ {
jsonB := EncodeJSON(in)
DecodeJSON(jsonB)
}
}
func BenchmarkMsgPack(b *testing.B) {
for i := 0; i < b.N; i++ {
b := EncodeMsgPack(in)
DecodeMsgPack(b)
}
}
func EncodeMsgPack(message map[string]interface{}) []byte {
b, _ := msgpack.Marshal(message)
return b
}
func DecodeMsgPack(b []byte) (out map[string]interface{}) {
_ = msgpack.Unmarshal(b, &out)
return
}
func EncodeJSON(message map[string]interface{}) []byte {
b, _ := json.Marshal(message)
return b
}
func DecodeJSON(b []byte) (out map[string]interface{}) {
_ = json.Unmarshal(b, &out)
return
}
I would suggest to do some benchmarks on the kind of data that the machines will be talking to each other.
I would suggest to try Protocol Buffers (Encoding) + Snappy (compression)
msgpack only promises to be shorter than json, not faster to parse. In both cases, your test string is so short and simple that your benchmarking may simply be testing the maturity of the particular implementation, rather than the underlying algorithms.
If all your messages really are this short, parsing speed may be the least of your problems. I'd suggest designing your server such that the parsing bits are easily replaceable and actually profiling the code in action.
Donald Knuth made the following statement on optimization:
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil"
Lastly if you really want to know what is going on, you need to profile the code. See
http://blog.golang.org/profiling-go-programs
for an example of how to profile code with go.
Also, your test cases are reversed
BenchmarkJSON actually calls MsgPack
and
BenchmarkMsgPack calls Json
could that have something to do with it?