Why does go treat JSON as []byte instead of string? [closed] - json

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
RFC 7159 says
JavaScript Object Notation (JSON) is a text format for the
serialization of structured data.
But Go treats JSON as []byte
func Marshal(v interface{}) ([]byte, error)
func Unmarshal(data []byte, v interface{}) error
Why don't these functions take and return a string?
I could not find any explanation here
https://golang.org/pkg/encoding/json/
https://blog.golang.org/json-and-go

Go does not go by "strings are for text, byte types are for other stuff" like some other languages (e.g. Python 3) do. "In Go, a string is in effect a read-only slice of bytes." The string type has a few behaviors attached that are handy for dealing with UTF-8 text, but it'll hold whatever bytes you put in it. Text-handling stuff in the standard library is often written to work with []bytes too, e.g. package bytes mirrors package strings and regexp deals in either.
Given that there's no rule about text/binary semantically belonging in one type or the other, the choice to use []byte was probably made for practical reasons. Since strings are read-only slices of bytes, almost all operations changing strings have to copy bytes to a new string instead of modifying the existing one. (String slicing is a key exception; it just makes a new string header that can point into the old string's bytes.)
Copying string contents for each operation leads to a quadratic slowdown as the string length and number of copies both grow with input size. On top of the direct cost of the copies, allocating the space for them makes garbage collection happen more often. For those reasons, almost everything that builds up content via a lot of small operations in Go uses a []byte internally. That includes Go's JSON-marshalling code, and the strings.Builder class added in Go 1.10.
(For similar reasons, Java and C# offer string-builder types as well and modern JavaScript VMs have clever tricks to defer copying bytes until after a long series of concat operations, such as V8's cons strings and SpiderMonkey's ropes.)
Because []bytes are read-write and strings are read-only, converting one to the other also has to copy bytes. If MarshalJSON returned a string, that would require making another copy of the content (and the associated load on the GC). Also, if you're ultimately going to do I/O with this, Write() takes a byte slice, so for that you'd have to convert back, creating another copy. (To slightly mitigate that, some I/O types including *os.File support WriteString() as well. But not all do!)
So it makes more sense for json.Encoder to return the []byte it built up internally; you can of course call string(bytes) on the result if you need a string and the copying isn't a problem.
A bit out of the original question's scope, but often the best performing option is just to stream the output directly to an io.Writer using a json.Encoder. You never have to allocate the whole chunk of output at once, and it can make your code simpler as well since there's no temp variable and you can handle marshalling and I/O errors in one place.

Related

Viewstate: 2 different formats?

Trying to scrape a webpage, I hit the necessity to work with ASP.NET's __VIEWSTATE variables. So, ever the optimist, I decided to read up on those variables, and their formats. Even though classified as Open Source by Microsoft, I couldn't find any formal definition:
Everybody agrees the first step to do is decode the string, using a Base64 decoder. Great - that works...
Next - and this is where the confusion sets in:
Roughly 3/4 of the decoders seem to use binary values (characters whose values indicate the the type of field which is follow). Here's an example of such a specification. This format also seems to expect a 'signature' of 0xFF 0x01 as first two bytes.
The rest of the articles (such as this one) describe a format where the fields in the format are separated (or marked) by t< ... >, p< ... >, etc. (this seems to be the case of the page I'm interested in).
Even after looking at over a hundred pages, I didn't find any mention about the existence of two formats.
My questions are: Are there two different formats of __VIEWSTATE variables in use, or am I missing something basic? Is there any formal description of the __VIEWSTATE contents somewhere?
The view state is serialized and deserialized by the
System.Web.UI.LosFormatter class—the LOS stands for limited object
serialization—and is designed to efficiently serialize certain types
of objects into a base-64 encoded string. The LosFormatter can
serialize any type of object that can be serialized by the
BinaryFormatter class, but is built to efficiently serialize objects
of the following types:
Strings
Integers
Booleans
Arrays
ArrayLists
Hashtables
Pairs
Triplets
Everything you need to know about ViewState: Understanding View State

http rest: Disadvantages of using json in query string?

I want to use query string as json, for example: /api/uri?{"key":"value"}, instead of /api/uri?key=value. Advantages, from my point of view, are:
json keep types of parameters, such are booleans, ints, floats and strings. Standard query string treats all parameters as strings.
json has less symbols for deep nested structures, For example, ?{"a":{"b":{"c":[1,2,3]}}} vs ?a[b][c][]=1&a[b][c][]=2&a[b][c][]=3
Easier to build on client side
What disadvantages could be in that case of json usage?
It looks good if it's
/api/uri?{"key":"value"}
as stated in your example, but since it's part of the URL then it gets encoded to:
/api/uri?%3F%7B%22key%22%3A%22value%22%7D
or something similar which makes /api/uri?key=value simpler than the encoded one; in this case both for debugging outbound calls (ie you want to check the actual request via wireshark etc). Also notice that it takes up more characters when encoded to valid url (see browser limitation).
For the case of 'lesser symbols for nested structures', It might be more appropriate to create a new resource for your sub resource where you will handle the filtering through your query parameters;
api/a?{"b":1,}
to
api/b?bfield1=1
api/a?aBfield1=1 // or something similar
Lastly for the 'easier to build in client side', i think it depends on what you use to create your client, usually query params are represented as maps so it is still simple.
Also if you need a collection for a single param then:
/uri/resource?param1=value1,value2,value3

Go JSON decoding is very slow. What would be a better way to do it?

I am using Go, Revel WAF and Redis.
I have to store large json data in Redis (maybe 20MB).
json.Unmarshal() takes about roughly 5 seconds. What would be a better way to do it?
I tried JsonLib, encode/json, ffjson, megajson, but none of them were fast enough.
I thought about using groupcache, but Json is updated in real time.
This is the sample code:
package main
import (
"github.com/garyburd/redigo/redis"
json "github.com/pquerna/ffjson/ffjson"
)
func main() {
c, err := redis.Dial("tcp", ":6379")
defer c.Close()
pointTable, err := redis.String(c.Do("GET", "data"))
var hashPoint map[string][]float64
json.Unmarshal([]byte(pointTable), &hashPoint) //Problem!!!
}
Parsing large JSON data does seem to be slower than it should be. It would be worthwhile to pinpoint the cause and submit a patch to the Go authors.
In the meantime, if you can avoid JSON and use a binary format, you will not only avoid this issue; you will also gain the time your code is now spending parsing ASCII decimal representations of numbers into their binary IEEE 754 equivalents (and possibly introducing rounding errors while doing so.)
If both your sender and receiver are written in Go, I suggest using Go's binary format: gob.
Doing a quick test, generating a map with 2000 entries, each a slice with 1050 simple floats, gives me 20 MB of JSON, which takes 1.16 sec to parse on my machine.
For these quick benchmarks, I take the best of three runs, but I make sure to only measure the actual parsing time, with t0 := time.Now() before the Unmarshal call and printing time.Now().Sub(t0) after it.
Using GOB, the same map results in 18 MB of data, which takes 115 ms to parse:
one tenth the time.
Your results will vary depending on how many actual floats you have there. If yours floats have a lot of significant digits, deserving their float64 representation, then 20 MB of JSON will contain much fewer than my two million floats. In that case the difference between JSON and GOB will be ever starker.
BTW, this proves that the problem lies indeed in the JSON parser, not in the amount of data to parse, nor in the memory structures to create (because both tests are parsing ~ 20 MB of data and recreating the same slices of floats.) Replacing all the floats with strings in the JSON gives me a parsing time of 1.02 sec, confirming that the conversion from string representation to binary floats does takes a certain time (compared to just moving bytes around) but is not the main culprit.
If the sender and the parser are not both Go, or if you want to squeeze the performance even further than GOB, you should use your own customised binary format, either using Protocol Buffers or manually with "encoding/binary" and friends.
Try fastjson. It is optimized for speed and usually parses JSON much faster comparing to standard encoding/json. Additionally, fastjson doesn't need structs adhering JSON schema - a single parser may parse multiple JSONs with distinct schemas.
Try https://github.com/json-iterator/go
I got 2x decoding speed upgrade comparing to official one, more benefit is jsoniter's APIs are compatible with encoding/json.

compact binary representation of json [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 months ago.
Improve this question
Are there any compact binary representations of JSON out there? I know there is BSON, but even that webpage says "in many cases is not much more efficient than JSON. In some cases BSON uses even more space than JSON".
I'm looking for a format that's as compact as possible, preferably some kind of open standard?
You could take a look at the Universal Binary JSON specification. It won't be as compact as Smile because it doesn't do name references, but it is 100% compatible with JSON (where as BSON and BJSON define data structures that don't exist in JSON so there is no standard conversion to/from).
It is also (intentionally) criminally simple to read and write with a standard format of:
[type, 1-byte char]([length, 4-byte int32])([data])
So simple data types begin with an ASCII marker code like 'I' for a 32-bit int, 'T' for true, 'Z' for null, 'S' for string and so on.
The format is by design engineered to be fast-to-read as all data structures are prefixed with their size so there is no scanning for null-terminated sequences.
For example, reading a string that might be demarcated like this (the []-chars are just for illustration purposes, they are not written in the format)
[S][512][this is a really long 512-byte UTF-8 string....]
You would see the 'S', switch on it to processing a string, see the 4-byte integer that follows it of "512" and know that you can just grab in one chunk the next 512 bytes and decode them back to a string.
Similarly numeric values are written out without a length value to be more compact because their type (byte, int32, int64, double) all define their length of bytes (1, 4, 8 and 8 respectively. There is also support for arbitrarily long numbers that is extremely portable, even on platforms that don't support them).
On average you should see a size reduction of roughly 30% with a well balanced JSON object (lots of mixed types). If you want to know exactly how certain structures compress or don't compress you can check the Size Requirements section to get an idea.
On the bright side, regardless of compression, the data will be written in a more optimized format and be faster to work with.
I checked the core Input/OutputStream implementations for reading/writing the format into GitHub today. I'll check in general reflection-based object mapping later this week.
You can just look at those two classes to see how to read and write the format, I think the core logic is something like 20 lines of code. The classes are longer because of abstractions to the methods and some structuring around checking the marker bytes to make sure the data file is a valid format; things like that.
If you have really specific questions like the endianness (Big) of the spec or numeric format for doubles (IEEE 754) all of that is covered in the spec doc or just ask me.
Hope that helps!
Yes: Smile data format (see Wikipedia entry. It has public Java implementation, C version in the works at github (libsmile). It has benefit of being more compact than JSON (reliably), but being 100% compatible logical data model, so it is easy and possible to convert back and forth with textual JSON.
For performance, you can see jvm-serializers benchmark, where smile competes well with other binary formats (thrift, avro, protobuf); sizewise it is not the most compact (since it does retain field names), but does much better with data streams where names are repeated.
It is being used by projects like Elastic Search and Solr (optionally), Protostuff-rpc supports it, although it is not as widely as say Thrift or protobuf.
EDIT (Dec 2011) -- there are now also libsmile bindings for PHP, Ruby and Python, so language support is improving. In addition there are measurements on data size; and although for single-record data alternatives (Avro, protobuf) are more compact, for data streams Smile is often more compact due to key and String value back reference option.
gzipping JSON data is going to get you good compression ratios with very little effort because of its universal support. Also, if you're in a browser environment, you may end up paying a greater byte cost in the size of the dependency from a new library than you would in actual payload savings.
If your data has additional constraints (such as lots of redundant field values), you may be able to optimize by looking at a different serialization protocol rather than sticking to JSON. Example: a column-based serialization such as Avro's upcoming columnar store may get you better ratios (for on-disk storage). If your payloads contain lots of constant values (such as columns that represent enums), a dictionary compression approach may be useful too.
Another alternative that should be considered these days is CBOR (RFC 7049), which has an explicitly JSON-compatible model with a lot of flexibility. It is both stable and meets your open-standard qualification, and has obviously had a lot of thought put into it.
Have you tried BJSON ?
Try to use the js-inflate to make and unmake blobs.
https://github.com/augustl/js-inflate
This is perfect and I use a lot.
You might also want to take a look at a library I wrote. It's called minijson, and it was designed for this very purpose.
It's Python:
https://github.com/Dronehub/minijson
Have you tried AVRO? Apache Avro
https://avro.apache.org/

convert html entities to unicode(utf-8) strings in c? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to decode HTML Entities in C?
This question is very similar to that one, but I need to do the same thing in C, not python. Here are some examples of what the function should do:
input output
< <
> >
ä ä
ß ß
The function should have the signature char *html2str(char *html) or similar. I'm not reading byte by byte from a stream.
Is there a library function I can use?
There isn't a standard library function to do the job. There must be a large number of implementation available in the Open Source world - just about any program that has to deal with HTML will have one.
There are two aspects to the problem:
Finding the HTML entities in the source string.
Inserting the appropriate replacement text in its place.
Since the shortest possible entity is '&x;' (but, AFAIK, they all use at least 2 characters between the ampersand and the semi-colon), you will always be shortening the string since the longest possible UTF-8 character representation is 4 bytes. Hence, it is possible to edit in situ safely.
There's an illustration of HTML entity decoding in 'The Practice of Programming' by Kernighan and Pike, though it is done somewhat 'in passing'. They use a tokenizer to recognize the entity, and a sorted table of entity names plus the replacement value so that they can use a binary search to identify the replacements. This is only needed for the non-algorithmic entity names. For entities encoded as 'ß', you use an algorithmic technique to decode them.
This sounds like a job for flex. Granted, flex is usually stream-based, but you can change that using the flex function yy_scan_string (or its relatives). For details, see The flex Manual: Scanning Strings.
Flex's basic Unicode support is pretty bad, but if you don't mind coding in the bytes by hand, it could be a workaround. There are probably other tools that can do what you want, as well.