JSON document size reduction without changing it's structure

JSON document size reduction without changing it's structure - json

Problem -
We do store dynamic JSON document in Cosmos-DB. We want to reduce document size without loosing search capabilities provided by Cosmos-DB.
Solution -
Thinking about selective key encoding of JSON key fields.
Why selective ? because key space is not fixed. So if we will start encoding all kind of keys then our encoding dictionary might take more then 2 MB space (or we can not control how much space it is going to take) which might create other problems.
For example –
Original Document -
“sample-key-name”: {
“sample-sub-key-one”: “I am value one”,
“sample-sub-key-two”: “I am value two”,
“sample-sub-key-three”: “I am value three”
}
Document After Encoding –
“a”: {
“k1”: “I am value one”,
“k2”: “I am value two”,
“K3”: “I am value three”
}
Dictionary : -
{
“a”: “sample-key-name”,
“k1”: “sample-sub-key-one”,
“k2”: “sample-sub-key-two”,
“k3”: “sample-sub-key-three”
}
Why are we not encoding complete JSON doc ? We will loose ability to search
What else could be explored/tried/used/done to make it more elegant (considering JSON doc is dynamic) ?
In one similar question data migration to cheap storage is advised, we have already implemented it. But on top of it we want to save cost for CosmosDB as well.

Related

Is gzip compression efficient for repetitive text content? Multiple terms that show up multiple times in the string that is being compressed

I have this products-like page that I'm using SSR with NextJS. Users can search, filter, pagination, etc.
It's not a huge amount of items, so I decided to send them all as props to the page, doing all that functionality on client-side only. Raw props data is now roughly ~350kb and gzipped is ~75kb.
I think this is worth it, because I can save a lot on database reads, and I didn't have to setup any search/cache server to implement the searching/filtering functionality. It's all done on the client, becuase it has all the items in-memory.
The product object has a shape similar to this:
{
longPropertyName: value,
stringEnum: 'LONG_STRING_VALUE'
}
What I could do to optmize data traffic would be to shorten property names, and refactor string enums into numeric enums, so it would be:
{
shortProp: value,
numericEnum: 1
}
This for sure would reduce the uncompressed data size from ~350kb to maybe ~250kb.
But I'm not sure if it's worth doing it, because I think the gzipped size would remain very much the same, because I'm assuming the gzip compression is very good in compressing repetitive text content, like property names and string enum values that show up multiple times in the data.
Would I get a reduction on the gzip size by the same factor? Or will it be just the same value?

gzip will find repeated strings that are no farther than 32K bytes from each other. It will encode up to 258 bytes of match at a time as a very compact length and distance. That is well suited to your application, whether or not you tokenize the information. Tokenization will improve the compression, both by making the matches shorter, permitting more text per match, but also by supporting more distant matches by bringing what was further away into that 32K.
You can also try more modern compressors such as zstd or lzma2 (xz), which support much farther distances and longer matches.

Just tested it locally and while I was able to remove 20kb from uncompressed data by replacing the string enums with numeric enums, the change in the compressed data was only 1.1kb.
It means that gzip is already doing 95% of the work for me.
Note: I tested the gzip size with:
gzip -c filename.min.js | wc -c
I ended up writing a minification/expand logic for those objects. I'm doing SSR, so I minify on the server, and expand on the client. That reduced the uncompressed props data from 350kb to 168kb and the final compressed page size from 75kb to 45kb.
Simple logic like:
On server:
{ longProp: value } => { a: value }
{ stringEnum: STRING_VALUE => { b: 5 }
On client:
{ a: value } => { longProp: value }
{ b: 5 } => { stringEnum: STRING_VALUE

Extract tokens from grammar

I have been working through the Advent of Code problems in Perl6 this year and was attempting to use a grammar to parse the Day 3's input.
Given input in this form: #1 # 1,3: 4x4 and this grammar that I created:
grammar Claim {
token TOP {
'#' <id> \s* '#' \s* <coordinates> ':' \s* <dimensions>
}
token digits {
<digit>+
}
token id {
<digits>
}
token coordinates {
<digits> ',' <digits>
}
token dimensions {
<digits> 'x' <digits>
}
}
say Claim.parse('#1 # 1,3: 4x4');
I am interested in extracting the actual tokens that were matched i.e. id, x + y from coordinates, and height + width from the dimensions from the resulting parse. I understand that I can pull them from the resulting Match object of Claim.parse(<input>), but I have to dig down through each grammar production to get the value I need e.g.
say $match<id>.hash<digits>.<digit>;
this seems a little messy, is there a better way?

For the particular challenge you're solving, using a grammar is like using a sledgehammer to crack a nut.
Like #Scimon says, a single regex would be fine. You can keep it nicely readable by laying it out appropriately. You can name the captures and keep them all at the top level:
/ ^
'#' $<id>=(\d+) ' '
'# ' $<x>=(\d+) ',' $<y>=(\d+)
': ' $<w>=(\d+) x $<d>=(\d+)
$
/;
say ~$<id x y w d>; # 1 1 3 4 4
(The prefix ~ calls .Str on the value on its right hand side. Called on a Match object it stringifies to the matched strings.)
With that out the way, your question remains perfectly cromulent as it is because it's important to know how P6 scales in this regard from simple regexes like the one above to the largest and most complex parsing tasks. So that's what the rest of this answer covers, using your example as the starting point.
Digging less messily
say $match<id>.hash<digits>.<digit>; # [｢1｣]
this seems a little messy, is there a better way?
Your say includes unnecessary code and output nesting. You could just simplify to something like:
say ~$match<id> # 1
Digging a little deeper less messily
I am interested in extracting the actual tokens that were matched i.e. id, x + y from coordinates, and height + width from the dimensions from the resulting parse.
For matches of multiple tokens you no longer have the luxury of relying on Perl 6 guessing which one you mean. (When there's only one, guess which one it guesses you mean. :))
One way to write your say to get the y coordinate:
say ~$match<coordinates><digits>[1] # 3
If you want to drop the <digits> you can mark which parts of a pattern should be stored in a list of numbered captures. One way to do so is to put parentheses around those parts:
token coordinates { (<digits>) ',' (<digits>) }
Now you've eliminated the need to mention <digits>:
say ~$match<coordinates>[1] # 3
You could also name the new parenthesized captures:
token coordinates { $<x>=(<digits>) ',' $<y>=(<digits>) }
say ~$match<coordinates><y> # 3
Pre-digging
I have to dig down through each grammar production to get the value I need
The above techniques still all dig down into the automatically generated parse tree which by default precisely corresponds to the tree implicit in the grammar's hierarchy of rule calls. The above techniques just make the way you dig into it seem a little shallower.
Another step is to do the digging work as part of the parsing process so that the say is simple.
You could inline some code right into the TOP token to store just the interesting data you've made. Just insert a {...} block in the appropriate spot (for this sort of thing that means the end of the token given that you need the token pattern to have already done its matching work):
my $made;
grammar Claim {
token TOP {
'#' <id> \s* '#' \s* <coordinates> ':' \s* <dimensions>
{ $made = ~($<id>, $<coordinatess><x y>, $<dimensions><digits>[0,1]) }
}
...
Now you can write just:
say $made # 1 1 3 4 4
This illustrates that you can just write arbitrary code at any point in any rule -- something that's not possible with most parsing formalisms and their related tools -- and the code can access the parse state as it is at that point.
Pre-digging less messily
Inlining code is quick and dirty. So is using a variable.
The normal thing to do for storing data is to instead use the make function. This hangs data off the match object that's being constructed corresponding to a given rule. This can then be retrieved using the .made method. So instead of $make = you'd have:
{ make ~($<id>, $<coordinatess><x y>, $<dimensions><digits>[0,1]) }
And now you can write:
say $match.made # 1 1 3 4 4
That's much tidier. But there's more.
A sparse subtree of a parse tree
.oO ( 🎶 On the first day of an imagined 2019 Perl 6 Christmas Advent calendar 🎶 a StackOverflow title said to me ... )
In the above example I constructed a .made payload for just the TOP node. For larger grammars it's common to form a sparse subtree (a term I coined for this because I couldn't find a standard existing term).
This sparse subtree consists of the .made payload for the TOP that's a data structure referring to .made payloads of lower level rules which in turn refer to lower level rules and so on, skipping uninteresting intermediate rules.
The canonical use case for this is to form an Abstract Syntax Tree after parsing some programming code.
In fact there's an alias for .made, namely .ast:
say $match.ast # 1 1 3 4 4
While this is trivial to use, it's also fully general. P6 uses a P6 grammar to parse P6 code -- and then builds an AST using this mechanism.
Making it all elegant
For maintainability and reusability you can and typically should not insert code inline at the end of rules but should instead use Action objects.
In summary
There are a range of general mechanisms that scale from simple to complex scenarios and can be combined as best fits any given use case.
Add parentheses as I explained above, naming the capture that those parentheses zero in on, if that is a nice simplification for digging into the parse tree.
Inline any action you wish to take during parsing of a rule. You get full access to the parse state at that point. This is great for making it easy to extract just the data you want from a parse because you can use the make convenience function. And you can abstract all actions that are to be taken at the end of successfully matching rules out of a grammar, ensuring this is a clean solution code-wise and that a single grammar remains reusable for multiple actions.
One final thing. You may wish to prune the parse tree to omit unnecessary leaf detail (to reduce memory consumption and/or simplify parse tree displays). To do so, write <.foo>, with a dot preceding the rule name, to switch the default automatic capturing off for that rule.

You can refer to each of you named portions directly. So to get the cordinates you can access :
say $match.<coordinates>.<digits>
this will return you the Array of digits matches. Ig you just want the values the easiest way is probably :
say $match.<coordinates>.<digits>.map( *.Int) or say $match.<coordinates>.<digits>>>.Int or even say $match.<coordinates>.<digits>».Int
to cast them to Ints
For the id field it's even easier you can just cast the <id> match to an Int :
say $match.<id>.Int

REST URLs: integer vs string and the behavior of PUT

This is a multi-part question:
Given a REST API with URLs containing natural numbers as path segments, is the generally expected behavior that the number be interpreted as an index or a key?
When performing a PUT against a deep resource path, is the generally expected behavior that the path be interpreted as a declaration of state? Meaning that all non-existent resources along the path be created. Or should an error be returned if any resource along the path does not exist?
Expanding on question 2, if the path does exist, and the path defines a resource structure differing from that which is present, should the preexisting resources be overwritten, again as a declaration of state, or should an error be returned indicating a type mismatch?
For example, consider the endpoint:
domain.tld/datasource/foo/2/bar/1/baz
foo is a string, and identifies a top level resource.
2 could be interpreted as either an index or a key.
bar is a string, interpreted as a key.
1 could be interpreted as either an index or a key.
baz is a string, interpreted as a key, pointing to a leaf node.
In other words, the data residing at domain.tld/datasource under the identifier foo could be any of the following:
index based:
[
null,
null,
{
'bar': [
null,
{'baz': null}
]
}
]
key based:
{
'2': {
'bar': {
'1': {
{'baz': null}
}
}
}
}
both index and key based:
{
'2': {
'bar': [
null,
{'baz': null}
]
}
}
Question 1
Should 2 and 1 be considered an integer or a string? As this is potentially impossible to know, is there a standard for type annotation in REST URLs for addressing this case? Some solutions on the whiteboard so far are as follows with the assertion that 2 is a key and 1 is an index:
domain.tld/datasource/foo/2:str/bar/1:int/baz
where :str indicates that the preceding value is a key
and :int indicates that the preceding value is an index
domain.tld/datasource/foo/2/bar/1/baz?types=ki
where k, being member 0 of types, maps to the first int-like segment, and indicates that the value is a key
and i, being member 1 of types, maps to the second int-like segment, and indicates that the value is an index
Question 2
If none of the above data was present, should a PUT against this path create those resources or return an error? If an error is returned, should each resource at each level be created individually, requiring multiple PUTs against the path?
Question 3
If the data from the first illustration (index based) is already present should the data from the second illustration (key based) forcibly overwrite all data at all levels in the path or return an error indicating a type mismatch? The inference here being that again, multiple PUTs are required for any assignment that changes the type.
I'm probably over-complicating the issue or missing something basic but I haven't found much in the way of definitive guidance. I have complete control over the system and can enforce any rules I see fit. However, I'm interested in the experience, meaning interactions should be easy to reason about, logical, expected, deterministic, etc.

From my point of view, you should never ever make something like 'deep resources' when trying to be 'restful' or 'resty' - i really don't see the benefit. It just makes the system way harder to understand, to use and to develop (eg.: see your questions :) ).
Why not keep it simple and having 'single' URLs for single resources? That way it is clear to the client what a PUT will do, what a DELETE will do.
So just as an example, you could have the list resource endpoint domain.com/datasource which will return a list of all foos registered. It will return a list of HREFs...like domain.com/foo/1 ... beneath some metadata, foo/1 could also include a list of bars....but again, they are not nested in the 'foo URI', they are simple top level resources eg 'domain.com/bar/1'.
This way a client can easily delete, update, create items. You can link them, bu setting the correct links in the entities.
Regarding your question 2 and 3: I think that totally depends on your system. If you see the link domain.com/datasource/foo/1/bar/2/baz as ONE big resource, meaning the response will not only include information about baz, but also of bar, foo and datasource, yes a put would 'recreate' (full update) the resource. If that link "only" returns information about baz, a put would only full update this resource.

Can I use TTL on members on couchbase documents?

I'm designing my back-end. I have a json array/queue/something which I only need any data that is at most 2 weeks old, that is continuously appended to. I only want to delete from this "queue", but not the container document. Can I use TTL for this, or does TTL only work for whole documents?
Is there a better way to do this? Should I store them in per-day or per-hour arrays as separate documents instead?
Running couchbase 2.2.

TTL in Couchbase only applies to whole documents, it's not possible to expire subsets of a document. Like you said you can always have separate documents with different expiry times in which you have a type,date and then the array of data as an element.
Then using a view like so:
function (doc, meta) {
if(meta.type == "json") {
if(doc.type == "ordered_data") {
if(doc.date) {
emit(dateToArray(doc.date));
}
}
}
}
You could emit all the related data ordered by date (flag descending set to true), it'd also allow your app to select specific dates by passing in one or more keys. I.e. selecting a date range of 2days,1week etc. When the document expires it'd be removed from the view when it updates (varies based upon your stale parameters plus ops a second/time).
Then you can do whatever joining or extra processing you need at the application layer. There are other options available but for me this would be the most sensible way to approach the problem, any problems just comment and we'll try again.
P.s. How big are you arrays going to become? If they are going to be very large then perhaps you'd need to look at a different tech or way to solve the problem.

Redis strings vs Redis hashes to represent JSON: efficiency?

I want to store a JSON payload into redis. There's really 2 ways I can do this:
One using a simple string keys and values.
key:user, value:payload (the entire JSON blob which can be 100-200 KB)
SET user:1 payload
Using hashes
HSET user:1 username "someone"
HSET user:1 location "NY"
HSET user:1 bio "STRING WITH OVER 100 lines"
Keep in mind that if I use a hash, the value length isn't predictable. They're not all short such as the bio example above.
Which is more memory efficient? Using string keys and values, or using a hash?

This article can provide a lot of insight here: http://redis.io/topics/memory-optimization
There are many ways to store an array of Objects in Redis (spoiler: I like option 1 for most use cases):
Store the entire object as JSON-encoded string in a single key and keep track of all Objects using a set (or list, if more appropriate). For example:
INCR id:users
SET user:{id} '{"name":"Fred","age":25}'
SADD users {id}
Generally speaking, this is probably the best method in most cases. If there are a lot of fields in the Object, your Objects are not nested with other Objects, and you tend to only access a small subset of fields at a time, it might be better to go with option 2.
Advantages: considered a "good practice." Each Object is a full-blown Redis key. JSON parsing is fast, especially when you need to access many fields for this Object at once. Disadvantages: slower when you only need to access a single field.
Store each Object's properties in a Redis hash.
INCR id:users
HMSET user:{id} name "Fred" age 25
SADD users {id}
Advantages: considered a "good practice." Each Object is a full-blown Redis key. No need to parse JSON strings. Disadvantages: possibly slower when you need to access all/most of the fields in an Object. Also, nested Objects (Objects within Objects) cannot be easily stored.
Store each Object as a JSON string in a Redis hash.
INCR id:users
HMSET users {id} '{"name":"Fred","age":25}'
This allows you to consolidate a bit and only use two keys instead of lots of keys. The obvious disadvantage is that you can't set the TTL (and other stuff) on each user Object, since it is merely a field in the Redis hash and not a full-blown Redis key.
Advantages: JSON parsing is fast, especially when you need to access many fields for this Object at once. Less "polluting" of the main key namespace. Disadvantages: About same memory usage as #1 when you have a lot of Objects. Slower than #2 when you only need to access a single field. Probably not considered a "good practice."
Store each property of each Object in a dedicated key.
INCR id:users
SET user:{id}:name "Fred"
SET user:{id}:age 25
SADD users {id}
According to the article above, this option is almost never preferred (unless the property of the Object needs to have specific TTL or something).
Advantages: Object properties are full-blown Redis keys, which might not be overkill for your app. Disadvantages: slow, uses more memory, and not considered "best practice." Lots of polluting of the main key namespace.
Overall Summary
Option 4 is generally not preferred. Options 1 and 2 are very similar, and they are both pretty common. I prefer option 1 (generally speaking) because it allows you to store more complicated Objects (with multiple layers of nesting, etc.) Option 3 is used when you really care about not polluting the main key namespace (i.e. you don't want there to be a lot of keys in your database and you don't care about things like TTL, key sharding, or whatever).
If I got something wrong here, please consider leaving a comment and allowing me to revise the answer before downvoting. Thanks! :)

It depends on how you access the data:
Go for Option 1:
If you use most of the fields on most of your accesses.
If there is variance on possible keys
Go for Option 2:
If you use just single fields on most of your accesses.
If you always know which fields are available
P.S.: As a rule of the thumb, go for the option which requires fewer queries on most of your use cases.

Some additions to a given set of answers:
First of all if you going to use Redis hash efficiently you must know
a keys count max number and values max size - otherwise if they break out hash-max-ziplist-value or hash-max-ziplist-entries Redis will convert it to practically usual key/value pairs under a hood. ( see hash-max-ziplist-value, hash-max-ziplist-entries ) And breaking under a hood from a hash options IS REALLY BAD, because each usual key/value pair inside Redis use +90 bytes per pair.
It means that if you start with option two and accidentally break out of max-hash-ziplist-value you will get +90 bytes per EACH ATTRIBUTE you have inside user model! ( actually not the +90 but +70 see console output below )
# you need me-redis and awesome-print gems to run exact code
redis = Redis.include(MeRedis).configure( hash_max_ziplist_value: 64, hash_max_ziplist_entries: 512 ).new
=> #<Redis client v4.0.1 for redis://127.0.0.1:6379/0>
> redis.flushdb
=> "OK"
> ap redis.info(:memory)
{
"used_memory" => "529512",
**"used_memory_human" => "517.10K"**,
....
}
=> nil
# me_set( 't:i' ... ) same as hset( 't:i/512', i % 512 ... )
# txt is some english fictionary book around 56K length,
# so we just take some random 63-symbols string from it
> redis.pipelined{ 10000.times{ |i| redis.me_set( "t:#{i}", txt[rand(50000), 63] ) } }; :done
=> :done
> ap redis.info(:memory)
{
"used_memory" => "1251944",
**"used_memory_human" => "1.19M"**, # ~ 72b per key/value
.....
}
> redis.flushdb
=> "OK"
# setting **only one value** +1 byte per hash of 512 values equal to set them all +1 byte
> redis.pipelined{ 10000.times{ |i| redis.me_set( "t:#{i}", txt[rand(50000), i % 512 == 0 ? 65 : 63] ) } }; :done
> ap redis.info(:memory)
{
"used_memory" => "1876064",
"used_memory_human" => "1.79M", # ~ 134 bytes per pair
....
}
redis.pipelined{ 10000.times{ |i| redis.set( "t:#{i}", txt[rand(50000), 65] ) } };
ap redis.info(:memory)
{
"used_memory" => "2262312",
"used_memory_human" => "2.16M", #~155 byte per pair i.e. +90 bytes
....
}
For TheHippo answer, comments on Option one are misleading:
hgetall/hmset/hmget to the rescue if you need all fields or multiple get/set operation.
For BMiner answer.
Third option is actually really fun, for dataset with max(id) < has-max-ziplist-value this solution has O(N) complexity, because, surprise, Reddis store small hashes as array-like container of length/key/value objects!
But many times hashes contain just a few fields. When hashes are small we can instead just encode them in an O(N) data structure, like a linear array with length-prefixed key value pairs. Since we do this only when N is small, the amortized time for HGET and HSET commands is still O(1): the hash will be converted into a real hash table as soon as the number of elements it contains will grow too much
But you should not worry, you'll break hash-max-ziplist-entries very fast and there you go you are now actually at solution number 1.
Second option will most likely go to the fourth solution under a hood because as question states:
Keep in mind that if I use a hash, the value length isn't predictable. They're not all short such as the bio example above.
And as you already said: the fourth solution is the most expensive +70 byte per each attribute for sure.
My suggestion how to optimize such dataset:
You've got two options:
If you cannot guarantee max size of some user attributes then you go for first solution, and if memory matter is crucial then
compress user json before storing in redis.
If you can force max size of all attributes.
Then you can set hash-max-ziplist-entries/value and use hashes either as one hash per user representation OR as hash memory optimization from this topic of a Redis guide: https://redis.io/topics/memory-optimization and store user as json string. Either way you may also compress long user attributes.

we had a similar issue in our production env , we have came up with an idea of gzipping the payload if it exceeds some threshold KB.
I have a repo only dedicated to this Redis client lib here
what is the basic idea is to detect the payload if the size is greater than some threshold and then gzip it and also base-64 it and then keep the compressed string as a normal string in the redis. on retrieval detect if the string is a valid base-64 string and if so decompress it.
the whole compressing and decompressing will be transparent plus you gain close to 50% network traffic
Compression Benchmark Results
BenchmarkDotNet=v0.12.1, OS=macOS 11.3 (20E232) [Darwin 20.4.0]
Intel Core i7-9750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.201
[Host] : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT DEBUG
Method
Mean
Error
StdDev
Gen 0
Gen 1
Gen 2
Allocated
WithCompressionBenchmark
668.2 ms
13.34 ms
27.24 ms
-
-
-
4.88 MB
WithoutCompressionBenchmark
1,387.1 ms
26.92 ms
37.74 ms
-
-
-
2.39 MB

To store JSON in Redis you can use the Redis JSON module.
This gives you:
Full support for the JSON standard
A JSONPath syntax for selecting/updating elements inside documents
Documents stored as binary data in a tree structure, allowing fast access to sub-elements
Typed atomic operations for all JSON values types
https://redis.io/docs/stack/json/
https://developer.redis.com/howtos/redisjson/getting-started/
https://redis.com/blog/redisjson-public-preview-performance-benchmarking/

You can use the json module: https://redis.io/docs/stack/json/
It is fully supported and allows you to use json as a data structure in redis.
There is also Redis Object Mappers for some languages: https://redis.io/docs/stack/get-started/tutorials/

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008