MySQL: Embedded JSON vs table - mysql

I'm designing the database schema for a video production project management app and struggling with how to persist some embedded, but not repeatable data. In the few CS courses I took, part of normalizing a relational database was identifying repeatable blocks and encapsulating them into their own table. What if I have a block of embedded/nested data that I know is likely to be unique to the record?
Example: A video record has many shoot_locations. Those locations are most likely never to be repeated. shoot_locations can also contain multiple shoot_times. Representing this in JSON, might look like this:
{
video: {
shoot_locations: [
{
name: "Bob's Pony Shack",
address: "99 Horseman Street, Anywhere, US 12345",
shoot_times: {
shoot_at: "2015-08-15 21:00:00",
...
}
},
{
name: "Jerry's Tackle",
address: "15 Pike Place, Anywhere, US 12345",
shoot_times: {
shoot_at: "2015-08-16 21:00:00"
...
}
}
],
...
}
}
Options...
store the shoot_locations in a JSON field (available in MySQL 5.7.8?)
create a separate table for the data.
something else?
I get the sense I should split embedded data into it's own tables and save JSON for non-crucial meta data.
Summary
What's the best option to store non-repeating embedded data?

ONE of the reasons of normalizing a database is to reduce redundancy (your "repeatable blocks")
ANOTHER reason is to allow "backwards" querying. If you wanted to know which video was shot at "15 Pike Place", your JSON solution will fail (you'll have to resort to sequential reading, decoding JSON which defeats the purpose of a RDBMS)
Good rules of thumb:
Structured data - put in tables and columns
Data that might be part of query conditions - put in tables and columns
Unstructured data you know you'll never query by - put into BLOBs, XML or JSON fields
If in doubt, use tables and columns. You might have to spend some extra time initially, but you will never regret it. People have regretted their choice for JSON fields (or XML, for that matter) again and again and again. Did I mention "again"?

Related

Loading Raw JSON Into Delta Lake (Like in Snowflake)

I am testing Delta Lake for a simple use case that is very easy in Snowflake, but I'm having a heck of a time understanding if it can be done, much less actually doing it.
I want to be able to load a JSON file "raw," without specifying a schema, and I want to be able to query and flatten it later. In Snowflake, I can create a column of type VARIANT and load the JSON text there, and later I can ask for the different parts by using :: and lateral flatten, etc.
The examples I've seen so far about Delta Lake have had "schema inference" or "autoloading" stipulations, and with those it seems that even if I don't specify a schema, one is created for me and then I still have to guess (or look up) what columns Delta Lake created for me so I can query those parts of the JSON. It seems a little too complicated.
This page has the following comment:
When ingesting data, you may need to keep it in a JSON string, and some data may not be in the correct data type.
... but it provides no example of how to do that. To me this suggests that you can somehow store the raw JSON and query it later, but I don't know how. Just make a STRING column and insert the JSON as string? Can someone post an example?
Am I trialing the wrong tool for what I need, or am I missing something? Thank you for your help.
As far as I'm aware, there is no direct equivalent to the VARIANT column in Snowflake. What that page is suggesting is storing the data as a string, and then using the semi-structured access operators to parse it as JSON on the fly.
e.g. given a table named devices with a column named specifications of type string with value
"""{
"device": "potato phone",
"sku": "POTATO0001",
}"""
Then you can query it like this:
SELECT specifications:device, specifications:sku from devices
edit: to address some of your other questions
This doesn't do schema enforcement. It's possible to create a Struct column in delta lake that can store structured data, but all the data in that column need to be compatible with the Struct schema. If you are querying a JSON string column, you are on your own for schema management.

How to store large JSON documents(>20MB) in MongoDB without using GridFS

I want to store a large document in MongoDB, however, these are the two ways I will interact with the document:
I do frequent reads of that data and need to get a part of that data using aggregations
When I need to write to the document, I will be building it from scratch again, i.e remove the document that exists and insert a new one.
Here is how a sample document looks like:
{
"objects_1": [
{
}
],
"objects_2": [
{
}
],
"objects_3": [
{
}
],
"policy_1": [
{
}
],
"policy_2": [
{
}
],
"policy_3": [
{
}
]
}
Here is how I want to access that data:
{
"objects_1": [
{
}
}
If I was storing it in a conventional way, I would write a query like this:
db.getCollection('configuration').aggregate([
{ $match: { _id: "FAAAAAAAAAAAA" } },
{ $project: {
"_id": 0,
"a_objects": {
$filter: {
input: "$settings.a_objects",
as: "arrayItem",
cond: { $eq: [ "$$arrayItem.name", "objectName" ] }
}
}
}}
])
However, since the size of the document is >16 MB, we cant save it directly to MongoDB. The size can be a max of 50MB.
Solutions I thought of:
I thought of storing the json data in gridfs format and reading it as per the docs here: https://docs.mongodb.com/manual/core/gridfs/ . However, then I would need to read the entire file every time I want to look up only one object inside the large json blob, and I need to do such reads frequently, on multiple large documents which would lead to high memory usage
I thought of splitting the json into parts and storing each object in it's own separate collection, and when I need to fetch the entire document, I can reassemble the json
How should I approach this problem? Is there something obvious that I am missing here?
I think your problem is that you're not using the right tools for the job, or not using the tools you have in the way they were meant to be used.
If you want to persist large objects as JSON then I'd argue that a database isn't a natural choice for that - especially if the objects are large. I'd be looking at storage systems designed to do that well (say if your solution is on Azure/AWS/GCP see what specialist service they offer) or even just the file system if you run on a local server.
There's no reason why you can't have the JSON in a file and related data in a database - yes there are issues with that but the limitations of MongoDB won't be one of them.
I do frequent reads of that data and need to get a part of that data using aggregations
If you are doing frequent reads, and only for part of the data, then forcing your system to always read the whole record means you are just penalizing yourself. One option is to store the bits that are highly read in a way that doesn't incur the performance penalty of the full read.
Storing objects as JSON means you can change your program and data without having to worry about what the database looks like, its convenient. But it also has it's limitations. If you think you have hit those limitations then now might be the time to consider a re-architecture.
I thought of splitting the JSON into parts and storing each object in it's own separate collection, and when I need to fetch the entire document, I can reassemble the JSON
That's definably worth looking into. You just need to make sure that the different parts are not stored in the same table / rows, otherwise there'll be no improvement. Think carefully about how you spilt the objects up - think about the key scenarios the objects deal with - e.g. you mention reads. Designing the sub-objects to align with key scenarios is the way to go.
For example, if you commonly show an object's summary in a list of object summaries (e.g. search results), then the summary text, object name, id are candidates for object data that you would split out.

Schemaless Support for Elastic Search Queries

Our REST API allows users to add custom schemaless JSON to some of our REST resources, and we need it to be searchable in Elasticsearch. This custom data and its structure can be completely different across resources of the same type.
Consider this example document:
{
"givenName": "Joe",
"username": "joe",
"email": "joe#mailinator.com",
"customData": {
"favoriteColor": "red",
"someObject": {
"someKey": "someValue"
}
}
}
All fields except customData adhere to a schema. customData is always a JSON Object, but all the fields and values within that Object can vary dramatically from resource to resource. There is no guarantee that any given field name or value (or even value type) within customData is the same across any two resources as users can edit these fields however they wish.
What is the best way to support search for this?
We thought a solution would be to just not create any mapping for customData when the index is created, but then it becomes unqueryable (which is contrary to what the ES docs say). This would be the ideal solution if queries on non-mapped properties worked, and there were no performance problems with this approach. However, after running multiple tests for that matter we haven’t been able to get that to work.
Is this something that needs any special configuration? Or are the docs incorrect? Some clarification as to why it is not working would be greatly appreciated.
Since this is not currently working for us, we’ve thought of a couple alternative solutions:
Reindexing: this would be costly as we would need to reindex every index that contains that document and do so every time a user updates a property with a different value type. Really bad for performance, so this is likely not a real option.
Use multi-match query: we would do this by appending a random string to the customData field name every time there is a change in the customData object. For example, this is what the document being indexed would look like:
{
"givenName": "Joe",
"username": "joe",
"email": "joe#mailinator.com",
"customData_03ae8b95-2496-4c8d-9330-6d2058b1bbb9": {
"favoriteColor": "red",
"someObject": {
"someKey": "someValue"
}
}
}
This means ES would create a new mapping for each ‘random’ field, and we would use phrase multi-match query using a "starts with" wild card for the field names when performing the queries. For example:
curl -XPOST 'eshost:9200/test/_search?pretty' -d '
{
"query": {
"multi_match": {
"query" : "red",
"type" : "phrase",
"fields" : ["customData_*.favoriteColor"]
}
}
}'
This could be a viable solution, but we are concerned that having too many mappings like this could affect performance. Are there any performance repercussions for having too many mappings on an index? Maybe periodic reindexing could alleviate having too many mappings?
This also just feels like a hack and something that should be handled by ES natively. Am I missing something?
Any suggestions about any of this would be much appreciated.
Thanks!
You're correct that Elasticsearch is not truly schemaless. If no mapping is specified, Elasticsearch infers field type primitives based upon the first value it sees for that field. Therefore your non-deterministic customData object can get you in trouble if you first see "favoriteColor": 10 followed by "favoriteColor": "red".
For your requirements, you should take a look at SIREn Solutions Elasticsearch plugin which provides a schemaless solution coupled with an advanced query language (using Twig) and a custom Lucene index format to speed up indexing and search operations for non-deterministic data.
Fields with same mapping will be stored as same lucene field in the lucene index (Elasticsearch shard). Different lucene field will have separate inverted index (term dict and index entry) and separate doc values. Lucene is highly optimized to store documents of same field in a compressed way. Using a mapping with different field for different document prevent lucene from doing its optimization.
You should use Elasticsearch Nested Document to search efficiently. The underlying technology is Lucene BlockJoin, which indexes parent/child documents as a document block.

Store multiple authors in to couchbase database

I am a newbie to "couchbase server". What i am looking for is to store 10 author names to couchbase document one after another. Someone please help me whether the structure is like a single document "author" and multiple values
{ id : 1, name : Auther 1}, { id : 2, name : Author 2}
OR store Author 1 to a document and Author 2 to another document.
If so, how can i increment the id automatically before "insert" command.
you can store all authors in a single document
{ doctype : "Authors",
AuthorNames:[
{
id: 1,
Name : "author1"
}
{
id: 2,
Name : "author2"
}
so on
]
IF you want to increase the ID, one is to enter one author name at a time in new document, but ID will be randomly generated and it would not in incremental order.
In Couchbase think more about how your application will be using the data more than how you are want to store it. For example, will your application need to get all of the 10 authors all of the time? If so, then one document might be worthwhile. Perhaps your application needs to only ever read/write one of the authors at a time. Then you might want to put each in their own, but have an object key pattern that makes it so you can get the object really fast. Objects that are used often are kept in the managed cache, other objects that are not used often may fall out of the managed cache...and that is ok.
The other factor is what your reads to writes ratio is on this data.
So like I said, it depends on how your application will be reading and writing your data. Use this as the guidance for how your data should be stored.
The single JSON document is pretty straight forward. The more advanced schema design where each author is in its own document and you access them via object key, might be a bit more complicated, but ultimately faster and more scalable depending on what I already pointed out. I will lay out an example schema and some possibilities.
For the authors, I might create each author JSON document with an object key like this:
authors::ID
Where ID is a value I keep in a special incrementer object that I will called authors::incrementer. Think of that object as a key value pair only holding an integer that happens to be the upper bound of an array. Couchbase SDKs include a special function to increment just such an integer object. With this, my application can put together that object key very quickly. If I want to go after the 5th author, I do a read by object key for "authors::5". If I need to get 10, I do a parallelized BulkGet function and get authors::1 through authors::10. If I want to get all the authors, I get the incrementer object, and get that integer and then to a parallelized bulk get. This way i can get them in order or in whatever order I feel like and I am accessing them by object key which is VERY fast in Couchbase.
All this being said, I could use a view to query this data or the upcoming "SQL for Documents" in Couchbase 4.0 or I can mix and match when I query and when I get objects by their key. Key access will ALWAYS be faster. It is the difference between asking a question then going and getting the object and simply knowing the answer and getting it immediately.

If JSON represents the 'object', what represents the 'class'?

JSON appears to be a nice way to represent a complex data structure in plain text. If we think of this complex data structure as analogous to an OOP object - an instance of a class - then is there a commonly used JSON-like format that represents the class itself (just the data part - forget methods)? Can JSON itself be used for this?
To put it another way, if JSON encodes name-value pairs, what should I use if I want to encode only the names?
The reason I want this is that I am designing a protocol to use with jQuery (to which I am a complete novice by the way). The client will communicate to the server the structure of the JSON object it wants back, and the server will return a JSON object of that structure with the values added.
The key point is that it is the client that is in full control of what data fields (name-value pairs) the server returns. It's a bit different from all the examples of jQuery that I've found so far on the web where the client makes a request (which usually includes a very limited set of parameters, if any) and the server makes the decision as to what fields to return in the JSON reply.
(Obviously, what the client asks for must be congruent with the server's data model; if the server has an array of widgets each with its own price, the client can't ask for an array of prices each with its own widget.)
This must be a common problem, and I don't want to reinvent the wheel. I want to adopt a solution that is already in common use across the web.
Edit
I just found JSON Schema. This is not what I am looking for. It contains way more than I need.
Edit
I'm looking more for a 'this is how it is usually done' answer, rather than a 'you could try…' answer. (I can invent dozens of possible answers myself.)
To encode only names within JSON, you could use a key/value pair where the key is either the class name or just a key named 'values' - with the value being an array of strings that are the names to be returned by the server. For example:
{ 'class_name' : [ "name1", "name2", "name3" ] }
The server can then either detect the class name from the key used and return the supplied values for the names in the array if the class supports it or ignore if it does not.
I'm looking more for a 'this is how it is usually done' answer
There is no single "correct" way to do what you want. Many people have their implementation. It depends on various factors -- what you want to do, where you want to do, how efficiently you want it to do?
For simple structures I would prefer and suggest the answer given by #dbr9979.
For nested structures, you can have nested arrays. Something like:
{
"nestedfield1": {
"nestedfield11":["nestedfield111", "nestedfield112"],
"nestedfield12":["nestedfield121", "nestedfield122"],
"__SIMPLE_FIELDS__": ["simplefield13", "simplefield14"]
}
}
The point is, if the key is __SIMPLE_FIELDS__, the value is an array of simple fields (string, numbers etc..), else the key stands for the key in the object.
For something more complex, what I would suggest is you have predefined structures, that both the server and the client know of. This is particularly useful when you have to make multiple identical requests. Assign some unique number for each of them. Something like:
1 => <the structure above>
2 => ["simplefield1", "simplefield2" ..]
3 => etc .. etc
The server stores the above structure and the relevant number in the database or something. And now, as it may be obvious by now, client sends across the id of the required structure, and the server responds in the appropriate fashion.
I think what you meant by this:
the client that is in full control of what data fields (name-value pairs) the server returns.
is like the difference between SELECT * FROM Bags and SELECT color, price FROM Bag in SQL. Am I interpreting you correctly?
You could query with:
{
'resource': 'Bag',
'field_names': ['color', 'price']
}
which will return the response:
{
'status': 'success',
'result': [
{'color': 'red', 'price': 50},
{'color': 'blue', 'price': 45},
]
}
most likely though, you may not actually need your request to be a JSON object; I've seen implementations where the field names is taken from the query string, like http://foo.com/bag?fields=color,price
I was looking for Partial Response.
RESTful API Design: can your API give developers just the information they need? explains it all and gives examples from LinkedIn, Facebook, and Google. Google and Facebook both have similar approaches. Here's how Lie Ryan's example would look using Google's approach:
url?fields=status,result(color,price)
Since Google and Facebook are behind this, I would not be surprised to see this become a de facto standard.
In my case I am likely to run into a length limitation on the URL and so have to use POST instead, but this is an excellent starting point for me.