Storage Optimisation: JSON vs String with delimiters - json

The below JSON file costs 163 bytes to store.
{
"locations": [
{
"station": 6,
"category": 1034,
"type": 5
},
{
"station": 3,
"category": 1171,
"type": 7
},
]
}
But, If the values are put together as a string with delimiters ',' and '_', 6_1034_5,3_1171_7 costs only 17 bytes.
What are the problems with this approach?
Thank you.

The problems that I have seen with this sort of approach are mainly centered around maintainability.
With the delimited approach, the properties of your location items are identified by ordinal. Since there are all numbers, there is nothing to tell you whether the first segment is the station, category, or type; you must know that in advance. Someone new to your code base may not know that and therefore introduce bugs.
Right now all of your data are integers, which are relatively easy to encode and decode and do not risk conflicting with your delimiters. However, if you need to add user-supplied text at some point, you run the risk of that text containing your delimiters. In that case, you will have to invent an escaping/encoding mechanism to ensure that you can reliably detect your delimiters. This may seem simple, but it is more difficult than you may suspect. I've seen it done incorrectly many times.
Using a well-known structured text format like XML or JSON has the advantages that it has fully developed and tested rules for dealing with all types of text, and there are fully developed and tested libraries for reading and writing it.
Depending on your circumstances, this concern over the amount of storage could be a micro-optimization. You might want to try some capacity calculations (e.g., how much actual storage is required for X items) and compare that to the expected number of items vs. the expected amount of storage that will be available.

Related

What are some ways to restrict the decimal places in JSON Schema?

I am trying to store decimal numbers with restricted number of decimal places in my JSON data, and initially, I wanted to do it using strings. However, the schema does not support this. So as of right now, I am restricted to using this:
{"type": "number", "multipleOf" : 0.1} <- 1 decimal place
{"type": "number", "multipleOf" : 0.01} <- 2 decimal places
This works fine in dev, but I know from first hand experience how quickly floats can break down in actual applications. So my first choice is still finding some way to store them as strings in my data. Is this possible with the current implementation of JSON Schema?
This is not something that is possible with JSON Schema for numbers.
If you can represent your number as a string, you can use regex in the JSON Schema to check this sort of thing.
Look up the pattern key word.
As per the previous answer, If you are happy to represent the number as a string, you can use a regex pattern.
The below will restrict a number to 15sf (potentially useful if you are concerned about floating point expressibility):
{
"type": "string",
"pattern": "^(?!(?:.*?[1-9]){15,})([-+]?\\s*\\d+\\.?\\d*?)$"
}

what's delimited_payload_string fieldtype in solr

I'm new to solr. Can someone explain what the delimited_payload_string field type is?
For example, can I store the following JSON object as a multi-valued delimited_payload_string?
"classifications": [
{"code": "RESTAURANT", "names": [{"nameLocale": "en-US", "name": "restaurant"}, {"nameLocale": "en-US", "name": "fast food"}]}
]
A delimited payload string is for attaching a payload to a specific field - a payload is an additional value that isn't visible in the document, but can be used by custom plugins. From Solr Payloads
Available alongside the positionally related information is an optional general purpose byte array. At the lowest-level, Lucene allows any term in any position to store whatever bytes it’d like in its payload area. This byte array can be retrieved as the term’s position is accessed.
[...]
A payload’s primary use case is to affect relevancy scoring; there are also other very interesting ways to use payloads, discussed here later. Built-in at Lucene’s core scoring mechanism is float Similarity#computePayloadFactor() which until now has not been used by any production code in Lucene or Solr; though to be sure, it has been exercised extensively within Lucene’s test suite since inception. It’s hardy, just under-utilized outside custom expert-level coding to ensure index-time payloads are encoded the same way they are decoded at query time, and to hook this mechanism into scoring.
In your case you probably want to flatten the document before indexing and indexing the value as separate tags, depending on what you want to do with the data.

After copying around 18GB csv file from data lake to DocumentDB, it shows me 100 GB in DocumentDB why?

I have copied around 18 GB csv file from data lake store to documentDB using copy activity of azure data factory. Its total of of 1 months data. I have copied 5 days data at a time using copy activity of ADF. After loading 25 days data I get error "Storage quota for 'Document' exceeded." I can see that in documentDB it shows size of that collection is 100GB. I am not getting how 18GB data becomes 100GB in DocumentDB. I have partition key in DocumentDB and default indexing policy. I know that because of indexing it will increase the size little bit. But I was not expecting this much. I am not sure whether I am doing anything wrong here. I do not have much experience with documentDB and while searching on this question I do not get any answer so posting this question here.
I tried copying another small data of 1.8 GB from data lake store to document DB in another collection. And it shows me size of around 14 GB in documentDB.
So it means documentdb has more data than actual data. Please help to understand why it shows almost 5 to 7 times more size in documentdb than actual size in data lake store.
Based on my experience, index occupy the space but the main reason for this issue is that the data is stored in the form of json in documentdb.
{
"color": "white",
"name": "orange",
"count": 1,
"id": "fruit1",
"arr":[1,2,3,4],
"_rid": "F0APAPzLigUBAAAAAAAAAA==",
"_self": "dbs/F0APAA==/colls/F0APAPzLigU=/docs/F0APAPzLigUBAAAAAAAAAA==/",
"_etag": "\"06001f2f-0000-0000-0000-5989c6da0000\"",
"_attachments": "attachments/",
"_ts": 1502201562
}
If you observe the json data, you could find that they are all key-values , because json schema-less.These key values are needed to occupy the space (1 byte per letter).
The JSON would also generate characters to be very human readable ,such as [ ] ,{ }, : and so on.These special characters also occupy the space.
Also, documentdb would generate System property occupy space,such as _rid,_self,_etag,_ts. You could refer to the official document.
If it's possible, shorter keys could effectively save space, like use n1 instead of name1.
Hope it helps you.
This is a common "problem" with hierarchical, self-describing formats such as XML, JSON, YAML etc.
First if you take a "relational format" with a fixed schema or formats that have no metadata such as CSV and represent it in JSON, you now explode the schema information into every single key/value property as Jay explains.
Additionally, if you then store that document, often the so called Document Object Model used to store it is exploding the original textual size by 2 to 10 times (depending on lengths of keys, complexity of documents etc.).
Thus the recommendation is that unless you really need the semistructured format provided by XML, JSON etc, you should consider reverting the storage back into a structured format such as a table.

Schemaless Support for Elastic Search Queries

Our REST API allows users to add custom schemaless JSON to some of our REST resources, and we need it to be searchable in Elasticsearch. This custom data and its structure can be completely different across resources of the same type.
Consider this example document:
{
"givenName": "Joe",
"username": "joe",
"email": "joe#mailinator.com",
"customData": {
"favoriteColor": "red",
"someObject": {
"someKey": "someValue"
}
}
}
All fields except customData adhere to a schema. customData is always a JSON Object, but all the fields and values within that Object can vary dramatically from resource to resource. There is no guarantee that any given field name or value (or even value type) within customData is the same across any two resources as users can edit these fields however they wish.
What is the best way to support search for this?
We thought a solution would be to just not create any mapping for customData when the index is created, but then it becomes unqueryable (which is contrary to what the ES docs say). This would be the ideal solution if queries on non-mapped properties worked, and there were no performance problems with this approach. However, after running multiple tests for that matter we haven’t been able to get that to work.
Is this something that needs any special configuration? Or are the docs incorrect? Some clarification as to why it is not working would be greatly appreciated.
Since this is not currently working for us, we’ve thought of a couple alternative solutions:
Reindexing: this would be costly as we would need to reindex every index that contains that document and do so every time a user updates a property with a different value type. Really bad for performance, so this is likely not a real option.
Use multi-match query: we would do this by appending a random string to the customData field name every time there is a change in the customData object. For example, this is what the document being indexed would look like:
{
"givenName": "Joe",
"username": "joe",
"email": "joe#mailinator.com",
"customData_03ae8b95-2496-4c8d-9330-6d2058b1bbb9": {
"favoriteColor": "red",
"someObject": {
"someKey": "someValue"
}
}
}
This means ES would create a new mapping for each ‘random’ field, and we would use phrase multi-match query using a "starts with" wild card for the field names when performing the queries. For example:
curl -XPOST 'eshost:9200/test/_search?pretty' -d '
{
"query": {
"multi_match": {
"query" : "red",
"type" : "phrase",
"fields" : ["customData_*.favoriteColor"]
}
}
}'
This could be a viable solution, but we are concerned that having too many mappings like this could affect performance. Are there any performance repercussions for having too many mappings on an index? Maybe periodic reindexing could alleviate having too many mappings?
This also just feels like a hack and something that should be handled by ES natively. Am I missing something?
Any suggestions about any of this would be much appreciated.
Thanks!
You're correct that Elasticsearch is not truly schemaless. If no mapping is specified, Elasticsearch infers field type primitives based upon the first value it sees for that field. Therefore your non-deterministic customData object can get you in trouble if you first see "favoriteColor": 10 followed by "favoriteColor": "red".
For your requirements, you should take a look at SIREn Solutions Elasticsearch plugin which provides a schemaless solution coupled with an advanced query language (using Twig) and a custom Lucene index format to speed up indexing and search operations for non-deterministic data.
Fields with same mapping will be stored as same lucene field in the lucene index (Elasticsearch shard). Different lucene field will have separate inverted index (term dict and index entry) and separate doc values. Lucene is highly optimized to store documents of same field in a compressed way. Using a mapping with different field for different document prevent lucene from doing its optimization.
You should use Elasticsearch Nested Document to search efficiently. The underlying technology is Lucene BlockJoin, which indexes parent/child documents as a document block.

Explaining JSON (structure) .. to a business user

Suppose you have some data you would want business users to contribute to, which will end up being represented as JSON. Data represents a piece of business logic your program knows how to handle.
As expected, JSON has nested sections, data has categorizations, some custom rules may optionally be introduced etc.
It so happens that you already a vision of what "a perfect" JSON should look like. That JSON is your starting point.
Question:
Is there a way one can take a (reasonably complex) JSON and present it in a (non-JSON) format, that would be easy for a non-technical person to understand?
If possible, could you provide an example?
What do you think of this?
http://www.codeproject.com/script/Articles/ArticleVersion.aspx?aid=90357&av=126401
Or, make your own using Ext JS for the visualization part. After all, JSON is a lingua franca on the web these days.
Apart from that, you could use XML instead of JSON, given that there are more "wizard" type tools for XML.
And finally, if when you say "business users" you mean "people who are going to laugh at you when you show them code," you should stop thinking about this as "How do I make people in suits edit JSON" and start thinking about it as "How do I make a GUI that makes sense to people, and I'll make it spit out JSON later."
Show them as key, value pairs. If your value has sub sections then show them as drill downs/tree structure. An HTML mockup which parses a JSON object in your system would help in the understanding.
Picked this example from JSON site
{
"name": "Jack (\"Bee\") Nimble",
"format": {
"type": "rect",
"width": 1920,
"height": 1080,
"interlace": false,
"frame rate": 24
}
}
Name,format would be the tree nodes.