Couchbase: What benefits do I get from using the document ID? - couchbase

I'm new to the NoSQL world as I've been programming RDBMS for a while now. In an RDBMS, you have the notion of a PRIMARY KEY per table. You reference other tables using FOREIGN KEYs and usually, if denormalized well, you have another table that just basically contains mapping from TABLE A and TABLE B so you can join them.
In Couchbase, there's this concept of a Document ID where a document has it's own unique key external from the document itself. What is this document ID good for? The only use I see for it is querying for the object itself (using USE KEYS clause).
I could just specify an "id" and "type" in my JSON document and just assign random UUIDs for the document key.
What benefits do I get from using it? ELI5 if possible.
And also, why do some developers add "prexifes" to the document ID (e.g
customer:: customername".

That is an excellent question, and the answer is both historical and technical.
Historical: Couchbase originated from CouchOne/CouchDB and Membase, the latter being a persistent distributed version of the memcached key-value store. Couchbase still operates as a key-value store, and the fastest way to retrieve a document is via a key lookup. You could retrieve a document using an index based on one of the document fields, but that would be slower.
Technically, the ability to retrieve documents extremely quickly given their ID is one advantage that makes Couchbase attractive for many users/applications (along with scalability and reliability).
Why do some developers add "prefixes" to document IDs, such as "customer::{customer name}". For issues related to fast retrieval and data modeling. Let's say you have a small document containing a customer's basic profile, and you use the customer's email address as the document ID. The customer logs in, and your application can retrieve this profile using a very fast k-v lookup using the e-mail as ID. You want to keep this document small so it can be retrieved more quickly.
Maybe the customer sometimes wants to view their entire purchase history. Your application might want to keep that purchase history in a separate document, because it's too big to retrieve unless you really need it. So you would store it with the document id {email}::purchase_history, so you can again use a k-v lookup to retrieve it. Also, you don't need to store the key for the purchase history record anywhere - it is implied. Similarly, the customer's mailing addresses might be stored under document ID {email}::addresses. Etc.
Data modeling in Couchbase is just as important as in a traditional RDBMS, but you go about it differently. There's a nice discussion of this in the free online training: https://training.couchbase.com/online?utm_source=sem&utm_medium=textad&utm_campaign=adwords&utm_term=couchbase%20online%20training&utm_content=&gclid=CMrM66Sgw9MCFYGHfgodSncCGA#
Why does Couchbase still use an external key instead of a primary key field inside the JSON? Because Couchbase still permits non-JSON data (e.g., binary data). In addition, while a relational database could permit multiple fields or combination of fields to be candidate keys, Couchbase uses the document ID for its version of sharding, so the document ID can't be treated like other fields.

Related

How do Salesforce query using the relationships table behind the scenes?

I'm trying to figure out how Salesforce's metadata architecture works behind the scenes. There's a video they've released ( https://www.youtube.com/watch?v=jrKA3cJmoms ) where he goes through many of the important tables that drive it along (about 18m in).
I've figured out the structure for the basic representation / storage / retrieval of simple stuff, but where i'm hazy is how the relationship pivot table works. I'll be happy when:
a) I know exactly how the pivot table relates to things (RelationId column he mentions is not clear to me)
b) I can construct a query for it.
Screenshot from the video
I've not had any luck finding any resources describing it at this level in the detail I need, or managed to find any packages that emulate it that I can learn from.
Does anyone have any low-level experience with this part of Salesforce that could help?
EDIT: Thank you, David Reed for further details in your edit. So presumably you agree that things aren't exactly as explained?
In the 'value' column, the GUID of the related record is stored
This allows ease of fetching -to-one related records and, with a little bit of simple SQL switching, resolve a group of records in the reverse direction.
I believe Salesforce don't have many-to-many relationships, as opposed to using a 'junction', so the above is still relevant
I guess now though I wonder what the point of the pivot table is at all, as there's a very simple relationship going on here now. Unless the lack of index on the value columns dictates the need for one...
Or, could it be more likely/useful if:
The record's value column stores a GUID to the relationship record and not directly to the related record?
This relationship record holds all necessary information required to put together a decent query and ALSO includes the GUID of the related record?
Neither option clear up the ambiguity for me, unless I'm missing something.
You cannot see, query, or otherwise access the internal tables that underlie Salesforce's on-platform schema. When you build an application on the platform, you query relationships using SOQL relationship queries; there are no pivot tables involved in the work you can see and do on the platform.
While some presentations and documentation discuss at some level the underlying implementation, the precise details of the SQL tables, schemas, query optimizers, and so on are not public.
As a Salesforce developer or developer who interacts with Salesforce via the API, you do not need to worry about the underlying SQL implementation used on Salesforce's servers at almost any time. The main point at which that knowledge can become helpful is when you are working with massive data volumes (multiple millions of records). The most helpful documentation for that use case is Best Practices for Deployments with Large Data Volumes. The underlying schema is briefly discussed under Underlying Concepts. But bear in mind
As a customer, you also cannot optimize the SQL underlying many application operations because it is generated by the system, not written by each tenant.
The implementation details are also subject to change.
Metadata Tables and Data Tables
When an organisation declares an object’s field with a relationship type, Force.com maps the field to a Value field in MT_Data, and then uses this field to store the ObjID of a related object.
I believe the documentation you mentioned is using the identifier ObjId ambiguously, and here actually means what it refers to earlier in the document as GUID - the Salesforce Id. Another paragraph states
The MT_Name_Denorm table is a lean data table that stores the ObjID and Name of each record in MT_Data. When an application needs to provide a list of records involved in a parent/child relationship, Force.com uses the MT_Name_Denorm table to execute a relatively simple query that retrieves the Name of each referenced record for display in the app, say, as part of a hyperlink.
This also doesn't make sense unless ObjId is being used to mean what is called GUID in the visual depiction of the table above in the document - the Salesforce Id of the record.

Store "extended" metadata on entities stored in Azure Cosmos DB as JSON documents

We are building a REST API in .NET deployed to Azure App Service / Azure API App. From this API, client can create "Products" and query "Products". The product entity has a set of fields that are common, and that all clients have to provide when creating a product, like the fields below (example)
{
"id": "cbf3f7aa-4743-4198-b307-260f703c42c1"
"name": "Product One"
"description": "The number one product"
}
We store these products currently as self-contained documents in Azure Cosmos DB.
Question 1: Partitioning.
The collection will not store a huge amount of documents, we talk about maximum around 2 500 000 documents between 1 - 5 kb each (estimates). We currently have chosen the id field (which is our system generated id, not the internal Cosmos DB document id) as partition key which means 2 500 000 logical partitions with one document each partition. The documents will be used in some low-latency workloads, but these workloads will query by id (the partition key). Clients will also query by e.g. name, and then we have a fan-out query, but those queries will not be latency-critical. In the portal, you can't create a single partition collection anymore, but you can do it from the SDK or have a fixed partition key value. If we have all these documents in one single partition (we talk about data far below 10 GB here), we will never get any fan-out queries, but rely more on the index within the one logical partition. So the question: Even if we don't have huge amounts of data, is it still wise to partition like we currently have done?
Question 2: Extended metadata.
We will face clients that want to write client/application/customer-specific metadata beyond the basic common fields. What is the best way to do this?
Some brainstorming from me below.
1: Just dump everything in one self-contained document.
One option is to allow clients in the API to add a type of nested "extendedMetadata" field with key-value pairs when creating a product. Cosmos DB is schema agnostic, so in theory this should work fine. Some products can have zero extended metadata, while other products can have a lot of extended metadata. For the clients, we can promise the basic common fields, but for the extended metadata field we cannot promise anything in terms of number of fields, naming etc. The document size will then vary. These products will as mentioned still be used in latency-critical workloads that will query by "id" (the partition key"). The extended metadata will never be used in any latency-critical workloads. How much and how in general affects the document size the performance / throughput? For the latency-critical read scenario, the query optimizer will go straight to the right partition, and then use the index to quickly retrieve the document fields of interest. Or will the whole document always be loaded and processed independent of which fields you want to query?
{
"id": "cbf3f7aa-4743-4198-b307-260f703c42c1"
"name": "Product One"
"description": "The number one product"
"extendedMetadta" : {
"prop1": "prop1",
"prop2": "prop2",
"propN": "propN"
}
}
The extended metadata is only useful to retrieve from the same API in certain situations. We can then do something like:
api.org.com/products/{id} -- will always return a product with the basic common fields
api.org.com/products/{id}/extended -- will return the full document (basic + extended metadata)
2: Split the document
One option might be to do some kind of splitting. If a client from the API creates a product that contains extended metadata, we can implement some logic that splits the document if extendedMetadata contains data. I guess the split can be done in many ways, brainstorming below. I guess the main objetive to split the documents (which require more work on write operations) is to get better throughput in case the document size plays a significant role here (in most cases, the clients will be ok with the basic common fields).
One basic document that only contains the basic common fields, and one extended document that (with the same id) contains the basic common fields + extended metadata (duplication of the basic common fields) We can add a "type" field that differentiates between the basic and extended document. If a client asks for extended, we will only query documents of type "extended".
One basic document that only contains the basic common fields + a reference to an extended document that only contains the extended metadata. This means a read operation where client asks for product with extended metadata require reading two documents.
Look into splitting it in different collections, one collection holds the basic documents with throughput dedicated to low-latency read scenarios, and one collection for the extended metadata.
Sorry for a long post. Hope this was understandable, looking forward for your feedback!
Answer 1:
If you can guarantee that the documents total size will never be more than 10GB, then creating a fixed collection is the way to go for 2 reasons.
First, there is no need for a cross partition query. I'm not saying it will be lightning fast without partitioning but because you are only interacting
with a simple physical partition, it will be faster than going in every single physical partition looking for data.
(Keep in mind however that every time people think that they can guarantee things like max size of something, it usually doesn't work out.)
The /id partitioning strategy is only efficient if you can ALWAYS provide the id. This is called a read. If you need to search by any other property, this means that
you are performing a query. This is where the system wouldn't do so well.
Ideally you should design your Cosmos DB collection in a way that you never do a cross partition query as part of your every day work load. Maybe once in a blue moon for reporting reasons.
Answer 2:
Cosmos DB is a NoSQL schema-less database for a reason.
The second approach in your brainstorming would be fitting for a traditional RDBMS database but we don't have that here.
You can simply go with your first approach and either have everything under a single property or just have them at the top level.
Remember that you can just map the response to any object that you want, so you can simply have 2 DTOs. A slim and an extended version
and just map to different versions depending on the endpoint.
Hope this helps.

Using Couchbase with thousands of different schemas

Consider a multi-tenant application in which tenants are free to model their own schemas. I.e.: backend-as-a-service.
With these requirements a 'table' per bucket is undoable. Instead, I'm thinking of simply having an attribute 'schema-id' define the id of the schema. Each 'schema-id' is a compound key based on tenantId + schemaid.
As far as retrieval goes only 'get by id' should be supported. In that sense I'm only using Couchbase as a k/v store instead of a documents store.
Any caveats to the above? Would the sheer number of entities per bucket be a problem? Any other things to think about?
The key pattern idea sounds great to me. You will have to make sure your cluster is sized correctly and stays sized correctly over time.
If you wanted to really control everything tightly, you could even front the whole thing with a simple REST API. Then you could control access tightly, control that key pattern, etc. Each user of the service would get an API key that would give them a session.
Going with different buckets for different schemas will not scale,because i think there is a restriction of only 10 buckets in CB.
Since the is key is known by the client we can map the data from CB to a particular class since we know what type of schema it will be from the key.
Example if the key is PRODUCT_1234 or USER_12345,then we know for first key the data is of type PRODUCT for 2nd it is of type USER.

Translate sql database schema to IndexedDB

I have three tables in my SQL Schema: Clients, with address and so on, orders with order details and files, which stores uploaded files. both the files table and the orders table contain foreign keys referencing the Client tables.
How would I do that in IndexedDB? IÄm new to this whole key-index-thinking and would just like to understand, how the same Thing would be done with indexedDB.
Now I know there is a shim.js file, but I'm trying to understand the concept itself.
Help and tips highly appreciated!
EDIT:
So I would really have to think about which queries I want to allow and then optimize my IndexedDB implementation for those queries, is that the main point here? Basically, I want to to store a customer once and then many orders for that customer and then be able to upload small files (preferably pdfs) for that customer, not even necessarily for each order (although if that's easy to implement, I may do it)... I see every customer as a separate entity, I wont have things like "give me all customers who ordered xy" - I only need to have each customer once and then store all the orders for the customer and all the files. I wanto be able to go: Search for customer with the name of XY - which then gives me a list of all orders and their dates and a list of the files uploaded for that customer (maybe associated to the order).
This question is a bit too broad to answer correctly. Nevertheless, the major concept to learn when transitioning from SQL to No-SQL (indexedDB) is the concept of object stores. Most SQL databases are relational and perform much of the work of optimizing queries for you. indexedDB does not. So the concepts of normalization and denormalization work a bit differently. The focal point is to explicitly plan your own queries. Unlike the design of an app/system that allows simple ad-hoc SQL queries that are designed at a later point in time, and possibly even easily added/changed at a later time, you really need to do a lot of the planning up front for indexedDB.
So it is not quite safe to say that the transition is simply a matter of creating three object stores to correspond to your three relational tables. For one, there is no concept of joining in indexedDB so you cannot join on foreign keys.
It is not clear from your question but your 3 tables are clients, orders, and files. I will go out on a limb here and make some guesses. I would bet you could use a single object store, clients. Then, for each client object, store the normal client properties, store an orders array property, and store a files array property. In the orders array, store order objects.
If your files are binary, this won't work, you will need to use blobs, and may even encounter issues with blob support in various browser indexedDB implementations (Chrome sort of supports it, it is unclear from version to version).
This assumes your typical query plan is that you need to do something like list the orders for a client, and that is the most frequently used type of query.
If you needed to do something across orders, independent of which client an order belongs to, this would not work so well and you would have to iterate over the entire store.
If the clients-orders relation is many to many, then this also would not work so well, because of the need to store the order info redundantly per client. However, one note here, is that this redundant storage is quite common in NoSQL-style databases like indexedDB. The goal is not to perfectly model the data, but to store the data in such a way that it your most frequently occurring queries complete quickly (while still maintaining correctness).
Edit:
Based on your edit, I would suggest a simple prototype that uses three object stores. In your client view page where you display client details, simply run three separate queries.
Get the one entity from the client object store based on client id.
Open a cursor over the orders and get all orders for the client. In the orders store, use a client-id property. Create an index on this client-id property. Open the cursor over the index for a specific client id.
Open a cursor over the files store using a similar tactic as #2.
In your bizlogic layer, enforce your data constraints. For example, when deleting a client, first delete all the files from the files store, then delete all the orders from the orders store, and then delete the single client entity from the client store.
What I am suggesting is to not overthink it. It is not that complicated. So far you have not described something that sounds like it will have performance issues so there is no need for something more elegant.
I will go with Josh answer but if you are still finding it hard to use indexeddb and want to continue using sql. You can use sqlweb - It will let you do operation inside indexeddb by using sql query.
e.g -
var connection = new JsStore.Instance('jsstore worker path');
connection.runSql("select * from Customers").then(function(result) {
console.log(result);
});
Here is the link - http://jsstore.net/tutorial/sqlweb/

External data (keys) mapping within own database

I am currently working on a project of a new system.
This system will be using several different web-services to produce composite data.
Some data is compound and I use relational SQL tables (server, in particular, is MySQL) to compose data for further usage.
My problem is that I have to implement some data mapping.
Take countries for example.
Within our system countries are keyed (primary key based of CHAR2 ascii column) on ISO 3166-1 alpha-2.
One web-service provides data in the very same format. While several other have their own, unique, integer type identifiers.
As I am about to implement in-code mapping, I would like to have a possibility to dynamically update mapping tables, without making changes to the code.
Thus I am thinking about mappings table.
I may produce a table service_mappings, that would contain arbitrary length columns such as service_id (my own identifier for particular service), ref_id (datum provided by web-service), model (what data I am mapping this to in my system), key (what key this [service_id, ref_id] correspond to in my model).
On the other hand, I may choose something like a mapping table for each separate model, that would contain less keys (take model from previous table, as it would be defined by table name). This could be more feasible to use with ORMs of some kind.
So, my question is as following: what is the correct approach, what is the most efficient, and maybe there is some completely different technique?
Cache hint
In response to recent answer by Alexey.
We are likely to use some caching technique (such as memcache), although for primary data source we would like to rely on MySQL as we have methods in-place for creating and restoring back-ups, and we would have to think how to implement them.
Also, MySQL seems to inhibit some methods for faster access, and according to research by DeNA it may, actually, be faster than noSQL alternatives on primary-key/unique-key look-ups.
In your case, i'm would leave model in mapping table instead create separate tables, because it's will be much easer to find proper mapping. If you want more efficient, you may use some nosql storage for this mapping (such as Redis or memcachedb) , which is often much faster and reliable.