Couchbase internals - couchbase

One question regarding data fetching approach,
First Approach:
Let say I have two document
userdoc1
{
“status”:“pending”
“usertype”:“VIP”
“userid”:“123”
}
for above document let say my documentid is status::usertype [just to clarify,this document id will be unique in our case ]
userdoc2
{
“userid”:“123”,
“fname”:“abc”,
“lname”:“xyz”,
“age”:20;
“address”:“asdf”
}
for userdoc2, let say userid is my documentid
If i do a get operation i would proceed like this (here idea is to fetch data based on document id)
select userid from userdoc1 with key “pending::VIP”;
and then
select * from userdoc2 with key “123”;
Second Approach:
I have only one document
userdoc
{
“status”:“pending”
“usertype”:“VIP”
“userid”:“123”
“fname”:“abc”,
“lname”:“xyz”,
“age”:20;
“address”:“asdf”
}
Here, documentid is “status::usertype”
and we have secondary index on userid
Here if get the data like this(here idea is to fetch data based on secondary index):
select * from userdoc where userd=“123”;
Could you please explain which approach will give high read performance assuming high data load with 100 of nodes in a cluster and XDCR and other factors ?

Option 1 is going to have two roundtrips from the client to the server to run two cheap queries. Option 2 is going to have one roundtrip from the client to the server to run one slightly more expensive query.
I can't be completely confident without measuring, but I would bet my money on option 2. Roundtrip costs can be a bitch.
Be sure to use a proper index on userid for option 2 and use a prepared query with the userid as a parameter. That should be the fastest option.

The dominant factor (as Johan Larson says in his answer) is likely to be the roundtrip count. Your first solution will have two roundtrips from the application to the cluster, while the second will have only one. There are some potential nuances, though.
An important point to note is that key/value retrievals are always going to be fastest. Those requests will go directly to nodes running the data service. With Couchbase, the clients access the node containing the data directly, not via a master-slave arrangement. In other words, you can fulfill a k/v request with a single round trip only involving the data node that has the actual document.
Using your first approach, you can avoid N1QL entirely. Just do a straight k/v get with id status::usertype, pull out the userid, then use that to do a get of the second document. You could even use the subdocument API to only return the userid.
The second approach will involve an index, and a N1QL query, so you're hitting potentially up to three different machines in your cluster. Whether this will be faster depends on topology. If your application is running alongside your cluster (meaning the network throughput/latency is similar to the intracluster times), I think the k/v approach could actually be faster. If the network latency from app to cluster is longer, the second approach is likely faster.
There's a further consideration. If the entire result is "covered" by the index you create for the query (meaning you store all the parts of the document you care about in your index) then the response can be provided entirely by the index service. This would cut the N1QL approach down to hitting the query service and the index service, which will be faster.
To go into a little more detail, your question involves data, indexing, and query. Couchbase splits these functions into separate services, meaning you can scale each capacity independently. That's also why you can be hitting three different machines with the N1QL query.
It will also depend on the nature of the data load. For example, if it's read-heavy vs. write-heavy. Write-heavy with an index will mean index updates, whereas read-heavy won't. Similarly, XDCR will be affected by read vs. write.

Related

Shard key with mostly even distribution. How to handle outliers?

I'm learning about sharding approaches. How to achieve good horizontal scalability with a large number of shards in an IO-heavy application. Below I describe a case that I expect to see in my app. I think that this would be a relatively common in the wild, however, I was unable to find much info on it.
Let's say that we need to shard a table/collection where each row is associated with a client. All queries will include a single client id (uuid). Updates and reads are mostly evenly distributed among clients.
From what I've read in this case I would want to use a hashed sharding key on the client id. Reads would touch a single shard providing best performance. Writes would be evenly distributed as long as clients produce relatively the same load.
But what to do if there is a very small subset of clients that produce so much IO load that a single shard would have trouble handling it?
If we change the sharding key for a random record ID then writes for all clients would be distributed across all shards. But reads would have to hit all shards which is not efficient, especially when there are a lot of them.
How do we achieve a balance: have average clients be evenly distributed, and at the same time allow large clients to occupy multiple shards? Are there any DB solutions that would be able to do this automatically? Or do we have to write custom logic for tracking DB load and redistributing large clients between shards? What should I read on the topic?
I'd suggest adding a new attribute to the client's records, for example we could call it part. Assign a single value to simple clients, and store the same value in part for all their records.
But heavy clients would be assigned multiple values for part, up to the number of shards. Every record for that client would set its part to one of these values. Assign them either randomly or round-robin, however you think is most efficient. The point being to use each part with approximately even frequency.
Your hashing algorithm for mapping clients to a shard would then use the client id + the part attribute. So each simple client would still store all their data on a single shard. But heavy clients will distribute their data over multiple shards.
This does mean that for the heavy clients, a read query would need to search multiple shards. Code your searches to loop over the part values for the client. For most clients, this loop will only need to execute once. For the heavy clients, the loop will execute once for each part value associated with that client.
To be honest, I've never seen a load so great that this would be necessary. It's more likely that the traffic for one client is too much for one database instance because the queries are not optimized well, or the application is running more queries than it should. It's important to make sure you analyze query efficiency before you make your sharding architecture more complex.
You've tagged your question with cockroachdb so you probably already suspect this, but CockroachDB handles sharding transparently. If your primary key is composite and the first column is the client id, data with the same client id will all fall in a contiguous key range, and therefore be generally stored on the same node. If a range gets bigger than a configurable limit, and/or gets much more traffic, CockroachDB will automatically split the range to rebalance storage and traffic across nodes. You'll mostly not have to pay attention to this, and for your pattern you won't want to do any explicit sharding. However, if you do need to inspect or tweak the behavior there are tools to do so such as SHOW RANGES.

Cosmos DB : Faster Search Options

We have huge cosmosDB container with billions of rows and almost 300 columns. Data is partitioned and modeled in a way we query it most of the time.
For example : User table is partitioned by userId thats why below query works fine.
Select * from User where userId = "user01234"
But in some cases, we need to query data differently that need sorting and then query.
For example : Get data from User Table using userpost and date of post
Select * from user where userPostId = "P01234" orderBy date limit 100
This query takes lot of time because of the size of data and data is not partitioned based on query2 (user Post).
My question is - How can we make query2 and other similar queries faster when data is not partitioned accordingly.
Option 1: "Create separate collection which is partitioned as per Query2" -
This will make query faster but for any new query we will end up creating a new collection, which is duplication of billions of records. [Costly Option]
Option 2: "Build elastic search on top of DB?" This is time consuming option and may be over killing for this slow query problem.
Is there any other option that can be used? Let me know your thoughts.
Thanks in advance!
Both options are expensive. The key is deciding which is cheaper, including running the cross-partition query. This will require you costing each of these options out.
For the cross-partition query, capture the RU charge in the response object so you know the cost of it.
For change feed, this will have an upfront cost as you run it over your existing collection, but whether that cost remains high depends on how much data is inserted or updated each month. Calculating the cost to populate your second collection will take some work. You can start by measuring the RU Charge in the response object when doing an insert then multiply by the number of rows. Calculating how much throughput you'll need will be a function of how quickly you want to populate your second collection. It's also a function of how much compute and how many instances you use to read and write the data to the second collection.
Once the second collection is populated, Change Feed will cost 2 RU/s to poll for changes (btw, this is configurable) and 1 RU/s to read each new item. The cost of inserting data into a second collection costs whatever it is when you measured it earlier.
If this second query doesn't get run that often and your data doesn't change that much, then change feed could save you money. If you run this query a lot and your data changes frequently too, change feed could still save you money.
With regards to Elastic Search or Azure Search, I generally find this can be more expensive than keeping the cross-partition query or change feed. Especially if you're doing it to just answer a second query. Generally this is a better option when you need true free text query capabilities.
A third option you might explore is using Azure Synapse Link and then run both queries using SQL Serverless or Spark.
Some other observations.
Unless you need all 300 properties in these queries you run, you may want to consider shredding these items into separate documents and storing as separate rows. Especially if you have highly asymmetric update patterns where only a small number of properties get frequently updated. This will save you a ton of money on updates because the smaller the item you update, the cheaper (and faster) it will be.
The other thing I would suggest is to look at your index policy and exclude every property that is not used in the where clause for your queries and include properties that are. This will have a dramatic impact on RU consumption for inserts. Also take a look at composite index for your date property as this has a dramatic impact on queries that use order by.

How to keep normalized models when searching via ElasticSearch?

When setting up a MySQL / ElasticSearch combo, is it better to:
Completely sync all model information to ES (even the non-search data), so that when a result is found, I have all its information handy.
Only sync the searchable fields, and then when I get the results back, use the id field to find the actual data in the MySQL database?
The Elasticsearch model of data prefers non-normalized data, usually. Depending on the use case (large amount of data, underpowered machines, too few nodes etc) keeping relationships in ES (parent-child) to mimic the inner joins and the like from the RDB world is expensive.
Your question is very open-ended and the answer depends on the use-case. Generally speaking:
avoid mimicking the exact DB Tables - ES indices plus their relationships
advantage of keeping everything in ES is that you don't need to update both mechanisms at the same time
if your search-able data is very small compared to the overall amount of data, I don't see why you couldn't synchronize just the search-able data with ES
try to flatten the data in ES and resist any impulse of using parent/child just because this is how it's done in MySQL
I'm not saying you cannot use parent/child. You can, but make sure you test this before adopting this approach and make sure you are ok with the response times. This is, anyway, a valid advice for any kind of approach you choose.
ElasticSearch is a search engine. I would advise you to not use it as a database system. I suggest you to only index the search data and a unique id from your database so that you can retrieve the results from MySQL using the unique key returned by ElasticSearch.
This way you'll be using both applications for what they're intended. Elastic search is not the best for querying relations and you'll have to write lot more code for operating on related data than simply using MySql for it.
Also, you don't want to tie up your persistence layer with search layer. These should be as independent as possible, and change in one should not affect the other, as much as possible. Otherwise, you'll have to update both your systems if either has to change.
Querying MySQL on some IDs is very fast, so you can use it and leave the slow part (querying on full text) to elastic search.
Although it's depend on situation, I would suggest you to go with #2:
Faster when indexing: we only fetch searchable data from DB and index to ES, compare to fetch all and index all
Smaller storage size: since indexed data is smaller than #1, it's more easier to backup, restore, recover, upgrade your ES in production. It'll also keep your storage size small when your data growing up, and you can also consider to use SSD to enhance performance with lower cost.
In general, a search app will search on some fields and show all possible data to user. E.g searching for products but will show pricing/stock info.. in result page, which only available in DB. So it's nature to have a 2nd step to query for extra info in DB and combine it with search results to display.
Hope it help.

What is the algorithm for query search in the database?

Good day everyone, I'm currently doing research on search algorithm optimization.
As of now, I'm researching on the Database.
In a database w/ SQL Support.
I can write the query for a specific table.
Select Number from Table1 where Name = "Test";
Select * from Table1 where Name = "Test";
1 searches the number from Table1 from where the Name is Test and 2 searches all the column for name Test.
I understand the concept of the function however what I'm interested in learning what is the approach of the search?
Is it just plain linear search where from the first index until the nth index it will grab so long as the condition is true thus having O(n) speed or does it have a unique algorithm that speeds its process?
If there's no indexes, then yes, a linear search is performed.
But, databases typically use a B Tree index when you specify a column(s) as a key. These are special data structure formats that are specifically tuned(high B Tree branching factors) to perform well on magnetic disk hardware, where the most significant time consuming factor is the seek operation(the magnetic head has to move to a diff part of the file).
You can think of the index as a sorted/structured copy of the values in a column. It can be determined quickly if the value being searched for is in the index. If it finds it, then it will also find a pointer that points back to the correct location of the corresponding row in the main data file(so it can go and read the other columns in the row). Sometimes a multi-column index contains all the data requested by the query, and then it doesn't need to skip back to the main file, it can just read what it found and then its done.
There's other types of indexes, but I think you get the idea - duplicate data and arrange it in a way that's fast to search.
On a large database, indexes make the difference between waiting a fraction of a second, vs possibly days for a complex query to complete.
btw- B tree's aren't a simple and easy to understand data structure, and the traversal algorithm is also complex. In addition, the traversal is even uglier than most of the code you will find, because in a database they are constantly loading/unloading chunks of data from disk and managing it in memory, and this significantly uglifies the code. But, if you're familiar with binary search trees, then I think you understand the concept well enough.
Well, it depends on how the data is stored and what are you trying to do.
As already indicated, a common structure for maintaining entries is a B+ tree. The tree is well optimized for disk since the actual data is stored only in leaves - and the keys are stored in the internal nodes. It usually allows a very small number of disk accesses since the top k levels of the tree can be stored in RAM, and only the few bottom levels will be stored on disk and require a disk read for each.
Other alternative is a hash table. You maintain in memory (RAM) an array of "pointers" - these pointers indicate a disk address, which contains a bucket that includes all entries with the corresponding hash value. Using this method, you only need O(1) disk accesses (which is usually the bottleneck when dealing with data bases), so it should be relatively fast.
However, a hash table does not allow efficient range queries (which can be efficiently done in a B+ tree).
The disadvantage of all of the above is that it requires a single key - i.e. if the hash table or B+ tree is built according to the field "id" of the relation, and then you search according to "key" - it becomes useless.
If you want to guarantee fast search for all fields of the relation - you are going to need several structures, each according to a different key - which is not very memory efficient.
Now, there are many optimizations to be considered according to the specific usage. If for example, number of searches is expected to be very small (say smaller loglogN of total ops), maintaining a B+ tree is overall less efficient then just storing the elements as a list and on the rare occasion of a search - just do a linear search.
Very gOod question, but it can have many answers depending on the structure of your table and how is normalized...
Usually to perform a seacrh in a SELECT query the DBMS sorts the table (it uses mergesort because this algorithm is good for I/O in disc, not quicksort) then depending on indexes (if the table has) it just match the numbers, but if the structure is more complex the DBMS can perform a search in a tree, but this is too deep, let me research again in my notes I took.
I recommend activating the query execution plan, here is an example in how to do so in Sql Server 2008. And then execute your SELECT statement with the WHERE clause and you will be able to begin understanding what is going on inside the DBMS.

Basic question: Querying data and performance tradeoffs

Let's say I have 100 rows in my table, with 3 columns of numbers. I don't need all the rows, only about half of them every time I fetch data. I only want the rows that have updated as getting the rest would be redundant.
Is it better to add a field and give it a datetime field to represent that it has updated since the last time I've fetched it (and use that as a criteria when SELECTing)? Or would it be better to simply download all the data each and every time (currently the data is being sent back as a JSON file).
What are the tradeoffs in terms of speed, bandwidth usage, and server cpu usage between these two options? Is the former just plain better than the latter?
Both Jens Struwe and roycl are right - but as you're asking a hypothetical question, you're going to get answers that are right and contradictory.
If only half the data is relevant, how is the client going to determine which data to show? If the decision can be made by software at all, it's more efficient to do it on the database - but it's also more logical.
With tables of 100 rows, performance is neither here nor there; maintainability and long-term upgradability is a far bigger deal. Most developers would expect a logical database design, and sorting/filtering to be done on the DB rather than the client.
Always (or at least if possible) select only data that you need to accomplish your task. Vice versa: Never select data that you have to filter out. In result: Add a timestamp field for the updates and select only these rows whose timestamp is > than the given one.
With a 100 rows in your table and 3 columns of numbers it really doesn't matter which approach you use if you don't mind if the server returns the data in less than a few 10s of milliseconds. The rows, if queried frequently, will all be in memory anyway. It also makes your json code simpler and your client code dumber (which is probably good, and more maintainable).
If you had a several-million row table with only a small percentage of data that was required, you would naturally want to limit the return set, and the easiest way of doing that is with an SQL WHERE clause, such as WHERE dt_modified > my_timestamp. On a properly optimised database even this query could come in at well under 100ms.
The issue may be more to do with time the data spends "on the wire", how much time the client spends either regenerating the page, or updating it based on the returned data. Client processing tim is often the slowest part of the process. Only testing on different browsers and over different network speeds will find the best balance between server-side tweeks, network fixes (such as gzipping to compress data) and optimising your javascript calls.