Auto update of child document when parent document updates - couchbase - couchbase

I am currently studying couchbase and I have to related documents like the following below:
Parent Document:
{
"userId":"testUser",
"password":"password123,
"status":"LOCKED"
}
Child Document:
{
"dateUpdated":"2014/12/21"
"remarks":"Sample remarks"
"user":{
"userId":"testUser",
"password":"password123,
"status":"LOCKED"
}
}
is it possible for couchbase to auto-update the child document if the are changes on the parent document, like if someone changes the user name or changes the status of the user.

There is no mechanism in Couchbase to do what you describe. The only way to do this is if you denormalize the data into a single document, so I would consider whether you really need these two entity types as separate documents.
If you absolutely must have this sort of transactional logic, and cannot denormalize the data into a single document, you can look into implementing a two-phase commit. I would recommend against it in most cases, because of the additional complexity and performance cost, but if you must, you must. http://docs.couchbase.com/developer/dev-guide-3.0/transactional-logic.html

Related

Is it better to store nested data or use flat structure with unique names in JSON?

In simple words: Is
{
"diary":{
"number":100,
"year":2006
},
"case":{
"number":12345,
"year":2006
}
}
or
{
"diary_number":100,
"diary_year":2006,
"case_number":12345,
"case_year":2006
}
better when using Elasticsearch?
In my case total keys are only a few (10-15). Which is better performance wise?
Use case is displaying data from noSQL database (mostly dynamoDB). Also feeding it into Elasticsearch.
My rule of thumb - if you would need to query/update nested fields, use flat structure.
If you use nested structure, then elastic will make it flat but then has an overhead of managing those relations. Performance wise - flat is always better since elastic doesnt need to related and find nested documents.
Here's an excerpt from Managing Relations Inside Elasticsearch which lists some disadvantages you might want to consider.
Elasticsearch is still fundamentally flat, but it manages the nested
relation internally to give the appearance of nested hierarchy. When
you create a nested document, Elasticsearch actually indexes two
separate documents (root object and nested object), then relates the
two internally. Both docs are stored in the same Lucene block on the
same Shard, so read performance is still very fast.
This arrangement does come with some disadvantages. Most obvious, you
can only access these nested documents using a special nested
query. Another big disadvantage comes when you need to update the
document, either the root or any of the objects.
Since the docs are all stored in the same Lucene block, and Lucene
never allows random write access to it's segments, updating one field
in the nested doc will force a reindex of the entire document.
This includes the root and any other nested objects, even if they were
not modified. Internally, ES will mark the old document as deleted,
update the field and then reindex everything into a new Lucene block.
If your data changes often, nested documents can have a non-negligible
overhead associated with reindexing.
Lastly, it is not possible to "cross reference" between nested
documents. One nested doc cannot "see" another nested doc's
properties. For example, you are not able to filter on "A.name" but
facet on "B.age". You can get around this by using include_in_root,
which effectively copies the nested docs into the root, but this get's
you back to the problems of inner objects.
Nested data is quite good. Unless you explicitly declare diary and case as nested field, they will be indexed as object fields. So elasticsearch will convert them itself to
{
"diary.number":100,
"diary.year":2006,
"case.number":12345,
"case.year":2006
}
Consider also, that every field value in elasticsearch can be a array. You need the nested datatype only if you have many diaries in a single document and need to "maintain the independence of each object in the array".
The answer is a clear it-depends. JSON is famous for its nested structures. However, there are some tools which only can deal with key-value structures and flat JSONs and I feel Elastic is more fun with flat JSONs, in particular if you use Logstash, see e.g. https://discuss.elastic.co/t/what-is-the-best-way-of-getting-mongodb-data-into-elasticsearch/40840/5
I am happy to be proven wrong..

How do I support “Tables” in Couchbase?

RDBMSes have tables; also, similar concepts exist in NoSQL, like Kinds in Google Datastore. But Couchbase puts everything into one big namespace. How do I arrange my data in a table-like way?
I want the performance advantages of table-like namespacing. If I have 1,000,000 rows of one type, and 10 rows of another, I'd rather that the query engine not have to look through 1,000,010 rows to find one of those ten.
Buckets are available, but only up to ten. So, these are not really table-like.
Tables could be implemented on the application layer with a type or kind property in each JsonDocument. But this mixes different abstraction layers: metadata with data.
You can prefix each key with a "Table"-like name. "User:111" instead of 111.
How can I achieve the benefits of Tables/Kinds in Couchbase?
Currently, the correct way to do this is to add an attribute which represents the type of the document, and then create indexes with your "type" attribute in it. So your query will scan directly the index instead of a full table scan. This might sound uncommon at first, but indexes are one of the most powerful features in CB.
You can see if your query is using the index you have created in the "Plan" tab of the web console:
https://blog.couchbase.com/couchbase-5-5-enhanced-query-plan-visualization/
If you are using Spring Data, it is done automatically or you through the attribute "_class" https://blog.couchbase.com/couchbase-spring-boot-spring-data/
Creating multiple buckets for this use case isn't a good strategy, as you will need some extra work whenever you need to make a join.
There are some metadata about the document which you can access via meta() in your query (ex: meta().id, meta().cas) but the type itself has to stay as a top-level attribute of the document.
You can prefix each key with a "Table"-like name. "User:111" instead of 111. -> This is useful when you need to filter wich documents should be replicated via Cross Data Center Replication https://blog.couchbase.com/deep-dive-cross-data-center-replication-xdcr/

Couchbase: What benefits do I get from using the document ID?

I'm new to the NoSQL world as I've been programming RDBMS for a while now. In an RDBMS, you have the notion of a PRIMARY KEY per table. You reference other tables using FOREIGN KEYs and usually, if denormalized well, you have another table that just basically contains mapping from TABLE A and TABLE B so you can join them.
In Couchbase, there's this concept of a Document ID where a document has it's own unique key external from the document itself. What is this document ID good for? The only use I see for it is querying for the object itself (using USE KEYS clause).
I could just specify an "id" and "type" in my JSON document and just assign random UUIDs for the document key.
What benefits do I get from using it? ELI5 if possible.
And also, why do some developers add "prexifes" to the document ID (e.g
customer:: customername".
That is an excellent question, and the answer is both historical and technical.
Historical: Couchbase originated from CouchOne/CouchDB and Membase, the latter being a persistent distributed version of the memcached key-value store. Couchbase still operates as a key-value store, and the fastest way to retrieve a document is via a key lookup. You could retrieve a document using an index based on one of the document fields, but that would be slower.
Technically, the ability to retrieve documents extremely quickly given their ID is one advantage that makes Couchbase attractive for many users/applications (along with scalability and reliability).
Why do some developers add "prefixes" to document IDs, such as "customer::{customer name}". For issues related to fast retrieval and data modeling. Let's say you have a small document containing a customer's basic profile, and you use the customer's email address as the document ID. The customer logs in, and your application can retrieve this profile using a very fast k-v lookup using the e-mail as ID. You want to keep this document small so it can be retrieved more quickly.
Maybe the customer sometimes wants to view their entire purchase history. Your application might want to keep that purchase history in a separate document, because it's too big to retrieve unless you really need it. So you would store it with the document id {email}::purchase_history, so you can again use a k-v lookup to retrieve it. Also, you don't need to store the key for the purchase history record anywhere - it is implied. Similarly, the customer's mailing addresses might be stored under document ID {email}::addresses. Etc.
Data modeling in Couchbase is just as important as in a traditional RDBMS, but you go about it differently. There's a nice discussion of this in the free online training: https://training.couchbase.com/online?utm_source=sem&utm_medium=textad&utm_campaign=adwords&utm_term=couchbase%20online%20training&utm_content=&gclid=CMrM66Sgw9MCFYGHfgodSncCGA#
Why does Couchbase still use an external key instead of a primary key field inside the JSON? Because Couchbase still permits non-JSON data (e.g., binary data). In addition, while a relational database could permit multiple fields or combination of fields to be candidate keys, Couchbase uses the document ID for its version of sharding, so the document ID can't be treated like other fields.

Duplicating columns from parent to child model. Good or bad practice?

I have a parent model Post and a child model Comment. Posts have privacy setting - column privacy in the DB. Any time when I have to deal with a child model Comment I have to check privacy settings if the parent model: $comment->post->privacy.
My app is becoming bigger and bigger and such approach needs more and more SQL-requests. Eager loading helps, but sometimes there is no other reasons to touch the parent model except of checking the privacy field.
My question is: Is it a good practice to duplicate the privacy column into the Posts table and keep them in sync? It will allow me to simply use $comment->privacy without touching the Posts table.
Planned redundancy (denormalization of the model) for a specific purpose can be good.
You specifically mention keeping the privacy column on the child table "in sync" with the privacy column in the parent table. That implies you have control of the redundancy. That's acceptable practice, especially for improved performance.
If it doesn't improve performance, then there wouldn't really be a need.
Uncontrolled redundancy can be bad.
Assuming that the privacy properties have to be in the parent (if the "Post" are not used directly on its own you can always move the property "privacy" to all the children)
First you should try enhance the performance using optimization techniques (like indexes, materialized views.. etc.)
Second if that didn't help much with the performance (very very rare case) you can start thinking about duplicating the information. but that should be your last option, and you need to take all the possible measures to preserve data consistency (using constraints, triggers or whatever).
Duplicating columns will be BAD in terms of space. Assume a situation when you will have huge amount of data in posts model, if you duplicate that, same amount of space will be used again, just to minimize your time.
Basically you always have to think about the trade off between space and time optimization.
Try to optimize time by some algorithmic approach like Hash tables, indexing, Binary Search Tree and all. And if you find it still time consuming after certain amount of data space, then think of duplicating data. But remember performance may be increased but space will be utilized more for the same.

Database design for recursive children

This design problem is turning out to be a bit more "interesting" than I'd expected....
For context, I'll be implementing whatever solution I derive in Access 2007 (not much choice--customer requirement. I might be able to talk them into a different back end, but the front end has to be Access (and therefore VBA & Access SQL)). The two major activities that I anticipate around these tables are batch importing new structures from flat files and reporting on the structures (with full recursion of the entire structure). Virtually no deletes or updates (aside from entire trees getting marked as inactive when a new version is created).
I'm dealing with two main tables, and wondering if I really have a handle on how to relate them: Products and Parts (there are some others, but they're quite straightforward by comparison).
Products are made up of Parts. A Part can be used in more than one Product, and most Products employ more than one Part. I think that a normal many-to-many resolution table can satisfy this requirement (mostly--I'll revisit this in a minute). I'll call this Product-Part.
The "fun" part is that many Parts are also made up of Parts. Once again, a given Part may be used in more than one parent Part (even within a single Product). Not only that, I think that I have to treat the number of recursion levels as effectively arbitrary.
I can capture the relations with a m-to-m resolution from Parts back to Parts, relating each non-root Part to its immediate parent part, but I have the sneaking suspicion that I may be setting myself up for grief if I stop there. I'll call this Part-Part. Several questions occur to me:
Am I borrowing trouble by wondering about this? In other words, should I just implement the two resolution tables as outlined above, and stop worrying?
Should I also create Part-Part rows for all the ancestors of each non-root Part, with an extra column in the table to store the number of generations?
Should Product-Part contain rows for every Part in the Product, or just the root Parts? If it's all Parts, would a generation indicator be useful?
I have (just today, from the Related Questions), taken a look at the Nested Set design approach. It looks like it could simplify some of the requirements (particularly on the reporting side), but thinking about generating the tree during the import of hundreds (occasionally thousands) of Parts in a Product import is giving me nightmares before I even get to sleep. Am I better off biting that bullet and going forward this way?
In addition to the specific questions above, I'd appreciate any other comentary on the structural design, as well as hints on how to process this, either inbound or outbound (though I'm afraid I can't entertain suggestions of changing the language/DBMS environment).
Bills of materials and exploded parts lists are always so much fun. I would implement Parts as your main table, with a Boolean field to say a part is "sellable". This removes the first-level recursion difference and the redundancy of Parts that are themselves Products. Then, implement Products as a view of Parts that are sellable.
You're on the right track with the PartPart cross-ref table. Implement a constraint on that table that says the parent Part and the child Part cannot be the same Part ID, to save yourself some headaches with infinite recursion.
Generational differences between BOMs can be maintained by creating a new Part at the level of the actual change, and in any higher levels in which the change must be accomodated (if you want to say that this new Part, as part of its parent hierarchy, results in a new Product). Then update the reference tree of any Part levels that weren't revised in this generational change (to maintain Parts and Products that should not change generationally if a child does). To avoid orphans (unreferenced Parts records that are unreachable from the top level), Parts can reference their predecessor directly, creating a linked list of ancestors.
This is a very complex web, to be sure; persisting tree-like structures of similarly-represented objects usually are. But, if you're smart about implementing constraints to enforce referential integrity and avoid infinite recursion, I think it'll be manageable.
I would have one part table for atomic parts, then a superpart table with a superpartID and its related subparts. Then you can have a product/superpart table.
If a part is also a superpart, then you just have one row for the superpartID with the same partID.
Maybe 'component' is a better term than superpart. Components could be reused in larger components, for example.
You can find sample Bill of Materials database schemas at
http://www.databaseanswers.org/data_models/
The website offers Access applications for some of the models. Check with the author of the website.