How to correctly calculate RAM requirement for a bucket in Couchbase - couchbase

We have a bucket of about 34 million items in a Couchbase cluster setup of 6 AWS nodes. The bucket has been allocated 32.1GB of RAM (5482MB per node) and is currently using 29.1GB. If I use the formula provided in the Couchbase documentation (http://docs.couchbase.com/admin/admin/Concepts/bp-sizingGuidelines.html) it should use approx. 8.94GB of RAM.
Am I calculating it incorrectly? Below is link to google spreadsheet with all the details.
https://docs.google.com/spreadsheets/d/1b9XQn030TBCurUjv3bkhiHJ_aahepaBmFg_lJQj-EzQ/edit?usp=sharing

Assuming that you indeed have a working set of 0.5%, which as Kirk pointed out in his comment, is odd but not impossible, then you are calculating the result of the memory sizing formula correctly. However, it's important to understand that the formula is not a hard and fast rule that fits all situations. Rather, it's a general guideline and serves as a good starting point for you to go and begin your performance tests. Also, keep in mind that the RAM sizing isn't the only consideration for deciding on cluster size, because you also have to consider data safety, total disk write throughput, network bandwidth, CPU, how much a single node failure affects the rest of the cluster, and more.
Using the result of the RAM sizing formula as a starting point, you should now actually test whether your working assumptions were correct. Which means putting real (or close to representative) load on the bucket and seeing whether the % of cache misses is low enough and the operation lacency is within your acceptable limits. There is no general rule for this, what's acceptable to some applications might be too slow for others.
Just as an example, if you see that under load your cache miss ratio is 5% and while the average read latency is 3ms, the top 1% latency is 100ms - then you have to consider whether having one out of every 100 reads take that much longer is acceptable in your application. If it is - great, if not - you need to start increasing the RAM size until it matches your actual working set. Similarly, you should keep an eye on the disk throughput, CPU usage, etc.

Related

How can I reduce compute costs and waste in my Foundry transforms?

We have a lot of pretty complicated data pipelines and the amount of compute being consumed has been steadily rising every month. How can I figure out where compute is being wasted and make things more efficient?
So, this will turn into a little bit of an involved answer but hopefully I can point people to a useful set of resources to help them manage waste.
Let's start in the obvious place. Compute profiles:
Engineers will commonly increase the executor memory to solve an executor OOM, the cause of this OOM is often skew. Try to mitigate the skew first and increase memory usage second.
Memory is relatively cheap, but when you increase memory you do so on every executor, which can get expensive across a large number of executors. Usually only a single executor is OOMing and 90% of the time it is due to skew.
Local Spark: You can use the compute profile KUBERNETES_NO_EXECUTORS on small transforms (a rule of thumb might be <50mb of input and output data) which will mean your transform will be run on the driver (reminder on drivers vs executors) This will mean 2 fewer modules are spun up reducing the amount of resources consumed by 66%. Often a job this small does not need executors and using them just causes shuffles and other wasted compute. When you're dealing with small data try to use local spark, your jobs will spin up faster, and will cost less.
Views: Docs on views have not been added to public docs yet, but you can find them on your platform docs at documentation/product/views/overview.
Views are a really useful way to reduce compute usage by eliminating the need for a transform altogether. Anywhere you have an identity transform being used to move a dataset between projects, or a transform that exists only to union several other datasets together, this transform can be replaced by a view. Views work by containing the information on the backing datasets and files, rather than containing any files themselves. They therefore require no processing of their own.
Incremental Pipelines: Where you have data that does not need to be changed after it is processed you might be able to use an incremental pipeline. This way you only process the new data as it comes into your pipeline without having to reprocess the entire mass of data.
This is probably the most powerful tool to reduce compute consumption in large intensive pipelines with high data throughput.

Couchbase compression free disk space requirement

According to old documentation:
If the amount of available disk space is less than twice the current database size, the compaction process does not take place and a warning is issued in the log.
Assuming this is still relevant for Couchbase 5.x (I couldn't find it in the latest docs), I'd like to know whether this requirement is truly for the entire bucket size (or even the entire database) - or rather per vBucket that's being compacted at a given point in time (since the compaction process happens per vBucket, with only 3 working in parallel by default).
If it's per compacting vBucket, I'd be less worried about having my single bucket take more than 50% of disk size, which right now I'm wary of and so I keep a very large margin of disk unutilized.
This questions was also asked on the Couchbase forums. I have copied my answer from there:
Assuming this is still relevant for Couchbase 5.x
This is still correct for Couchbase Server 5.X.
The recommendation is on the size of the whole bucket as you noted.
The calculation itself to check if compaction has enough space to complete successfully is done per vBucket. In other words compaction will run as long as there is enough space for that one vBucket to be compacted.
since the compaction process happens per vBucket, with only 3 working in parallel by default).
The default setting has been changed to one - for more details see MB-18426
I can't seem to find documentation in 5.x either, but again, assuming that old documentation still holds true, I think it's probably at the vbucket level as you suspect (some notes from Don Pinto in this old blog post seem to confirm). However, while documents are distributed relatively evenly between vbuckets, the actual size can vary, so I wouldn't make any assumptions about vbucket size if you're looking at disk space. Indexes also take up disk space.
But also note that if you're concerned about running into disk size limits, you can add another node, which will redistribute the vbuckets, and should free up more space on every node.

Does increase in number of mining pool decrease the block generation time

Can any one clarify whether (or, not) an increase in mining pool in Ethereum network with decrease the average block generation time? (For example, if another pool like ethermine join the network today and start mining). Since all the pools are competing with each other, I am getting confused
No, block generation times are driven by the current difficulty for solving the the algorithm used in the Proof of Work model. Only when a solution is found, is the block accepted to the chain and the difficulty determines how long it will take to find that solution. This difficulty automatically adjusts to speed up or slow down block generation times.
From the mining section of the Ethereum wiki:
The proof of work algorithm used is called Ethash (a modified version of Dagger-Hashimoto) involves finding a nonce input to the algorithm so that the result is below a certain threshold depending on the difficulty. The point in PoW algorithms is that there is no better strategy to find such a nonce than enumerating the possibilities while verification of a solution is trivial and cheap. If outputs have a uniform distribution, then we can guarantee that on average the time needed to find a nonce depends on the difficulty threshold, making it possible to control the time of finding a new block just by manipulating difficulty.
The difficulty dynamically adjusts so that on average one block is produced by the entire network every 12 seconds.
(Note that the current block generation time is closer to 15 seconds. You can find the block generation times on Etherscan)

Replication Factor to use for system_auth

When using internal security with Cassandra, what replication factor do you use for system_auth?
The older docs seem to suggest it should be N, where N is the number of nodes, while the newer ones suggest we should set it to a number greater than 1. I can understand why it makes sense for it to be higher - if a partition occurs and one section doesn't have a replica, nobody can log in.
However, does it need to be all nodes? What are the downsides of setting it to all ndoes?
Let me answer this question by posing another:
If (due to some unforseen event) all of your nodes went down, except for one; would you still want to be able to log into (and use) that node?
This is why I actually do ensure that my system_auth keyspace replicates to all of my nodes. You can't predict node failure, and in the interests of keeping your application running, it's better safe than sorry.
I don't see any glaring downsides in doing so. The system_auth keyspace isn't very big (mine is 20kb) so it doesn't take up a lot of space. The only possible scenario, would be if one of the nodes is down, and a write operation is made to a column family in system_auth (in which case, I think the write gets rejected...depending on your write consistency). Either way system_auth isn't a very write-heavy keyspace. So you'll be ok as long as you don't plan on performing user maintenance during a node failure.
Setting the replication factor of system_auth to the number of nodes in the cluster should be ok. At the very least, I would say you should make sure it replicates to all of your data centers.
In case you were still wondering about this part of your question:
The older docs seem to suggest it should be N where n is the number of nodes, while the newer ones suggest we should set it to a number greater than 1."
I stumbled across this today in the 2.1 documentation Configuring Authentication:
Increase the replication factor for the system_auth keyspace to N
(number of nodes).
Just making sure that recommendation was clear.
Addendum 20181108
So I originally answered this back when the largest cluster I had ever managed had 10 nodes. Four years later, after spending three of those managing large (100+) node clusters for a major retailer, my opinions on this have changed somewhat. I can say that I no longer agree with this statement of mine from four years ago.
This is why I actually do ensure that my system_auth keyspace replicates to all of my nodes.
A few times on mind-to-large (20-50 nodes) clusters , we have deployed system_auth with RFs as high as 8. It works as long as you're not moving nodes around, and assumes that the default cassandra/cassandra user is no longer in-play.
The drawbacks were seen on clusters which have a tendency to fluctuate in size. Of course, clusters which change in size usually do so because of high throughput requirements across multiple providers, further complicating things. We noticed that occasionally, the application teams would report auth failures on such clusters. We were able to quickly rectify these situation, by running a SELECT COUNT(*) on all system_auth tables, thus forcing a read repair. But the issue would tend to resurface the next time we added/removed several nodes.
Due to issues that can happen with larger clusters which fluctuate in size, we now treat system_auth like we do any other keyspace. That is, we set system_auth's RF to 3 in each DC.
That seems to work really well. It gets you around the problems that come with having too many replicas to manage in a high-throughput, dynamic environment. After all, if RF=3 is good enough for your application's data, it's probably good enough for system_auth.
The reason the recommendation changed was that a quorum query would require responses from over half of your nodes to fullfill. So if you accidentally left Cassandra user active and you have 80 nodes - we need 41 responses.
Whilst it's good practice to avoid using the super user like that - you'd be surprised how often it's still out there.

TCP Slow Start, Congestion Avoidance & Determining Bandwidth

Is there a formula someplace which can be used to determine the minimum number of segments / bytes which need to be transfered across a TCP connection to determine it's bandwidth and which takes into account Slow Start and Congestion Avoidance? I'm aware of the pathrate tool, but I want if possible something a bit simpler that I can incorporate in an app to get a descent ballpark figure. One example of usage would be downloading some data from a webserver in order to determine the optimum number of threads for downloading a bunch of small files automatically. This is related to a previous question I posted: TCP, HTTP and the Multi-Threading Sweet Spot
You can fire up scholar.google.com and search for "TCP chirp". However, that requires hires timers, and if you don't write a kernel tcp congestion control algorithm, you'd have to reimplement TCP in userspace. And that by itself will probably not give good results (general purpose OS are not very good at realtime hires timer related stuff, runnning in userspace).
In theory, using TCP chirp you need as few as 4-5 segments (typically, you'd get better resolution with a longer train of segments) to determine the "optimal" bandwidth.
In any case, since you can not know which path is used (ie. satellite link or tv broadcast in the forward direction), you may need a considerable amount of data (10+ MB, perhaps even 1GB) to get a decent measurement over arbitrary paths. (Satellites can have many dozend MB/s bandwidth, but also latencies in the 1000-3000 ms range; and TCP takes a couple round-trip times to open up cwnd (I'd say around 10 RTTs before a measurement should be started)...
I do not think that there is a fixed number of bytes required to be sent to determine the bandwidth. This number can depend on network type and speed.
Bandwidth is a measure of some resource transferred over a time interval. To get real data you need to measure it. Here are some hints how to do that