Couchbase compression free disk space requirement - couchbase

According to old documentation:
If the amount of available disk space is less than twice the current database size, the compaction process does not take place and a warning is issued in the log.
Assuming this is still relevant for Couchbase 5.x (I couldn't find it in the latest docs), I'd like to know whether this requirement is truly for the entire bucket size (or even the entire database) - or rather per vBucket that's being compacted at a given point in time (since the compaction process happens per vBucket, with only 3 working in parallel by default).
If it's per compacting vBucket, I'd be less worried about having my single bucket take more than 50% of disk size, which right now I'm wary of and so I keep a very large margin of disk unutilized.

This questions was also asked on the Couchbase forums. I have copied my answer from there:
Assuming this is still relevant for Couchbase 5.x
This is still correct for Couchbase Server 5.X.
The recommendation is on the size of the whole bucket as you noted.
The calculation itself to check if compaction has enough space to complete successfully is done per vBucket. In other words compaction will run as long as there is enough space for that one vBucket to be compacted.
since the compaction process happens per vBucket, with only 3 working in parallel by default).
The default setting has been changed to one - for more details see MB-18426

I can't seem to find documentation in 5.x either, but again, assuming that old documentation still holds true, I think it's probably at the vbucket level as you suspect (some notes from Don Pinto in this old blog post seem to confirm). However, while documents are distributed relatively evenly between vbuckets, the actual size can vary, so I wouldn't make any assumptions about vbucket size if you're looking at disk space. Indexes also take up disk space.
But also note that if you're concerned about running into disk size limits, you can add another node, which will redistribute the vbuckets, and should free up more space on every node.

Related

How to correctly calculate RAM requirement for a bucket in Couchbase

We have a bucket of about 34 million items in a Couchbase cluster setup of 6 AWS nodes. The bucket has been allocated 32.1GB of RAM (5482MB per node) and is currently using 29.1GB. If I use the formula provided in the Couchbase documentation (http://docs.couchbase.com/admin/admin/Concepts/bp-sizingGuidelines.html) it should use approx. 8.94GB of RAM.
Am I calculating it incorrectly? Below is link to google spreadsheet with all the details.
https://docs.google.com/spreadsheets/d/1b9XQn030TBCurUjv3bkhiHJ_aahepaBmFg_lJQj-EzQ/edit?usp=sharing
Assuming that you indeed have a working set of 0.5%, which as Kirk pointed out in his comment, is odd but not impossible, then you are calculating the result of the memory sizing formula correctly. However, it's important to understand that the formula is not a hard and fast rule that fits all situations. Rather, it's a general guideline and serves as a good starting point for you to go and begin your performance tests. Also, keep in mind that the RAM sizing isn't the only consideration for deciding on cluster size, because you also have to consider data safety, total disk write throughput, network bandwidth, CPU, how much a single node failure affects the rest of the cluster, and more.
Using the result of the RAM sizing formula as a starting point, you should now actually test whether your working assumptions were correct. Which means putting real (or close to representative) load on the bucket and seeing whether the % of cache misses is low enough and the operation lacency is within your acceptable limits. There is no general rule for this, what's acceptable to some applications might be too slow for others.
Just as an example, if you see that under load your cache miss ratio is 5% and while the average read latency is 3ms, the top 1% latency is 100ms - then you have to consider whether having one out of every 100 reads take that much longer is acceptable in your application. If it is - great, if not - you need to start increasing the RAM size until it matches your actual working set. Similarly, you should keep an eye on the disk throughput, CPU usage, etc.

Limiting the size of the Regular Expression Cache in JRuby

We are finding that the Regular Expression Cache in our JRuby application is out of control - it just keeps growing and growing until the app is grinding to a halt.
It eventually does garbage collect, but transaction time is becomes far too high (90 secs instead of 1-2 secs) long before that.
Is there a way to either stop this Regexp Cache from growing so much or limit the size of the cache?
first of all, since you already mentioned looking at the source at Very large retained heap size for org.jruby.RubyRegexp$RegexpCache in JRuby Rails App you probably realised there's no such support implemented.
would say you have 2-3 options to decide :
implement support for limiting or completely disabling the cache within JRuby's RubyRegexp
introduce a "hack" that will check available memory and clear out some of the cache RubyRegexp caches e.g. from another thread (at least until a PR is accepted into JRuby)
look into tuning or using a different GC (including some JVM options) so that the app performs more predictably ... this is application dependent and can no be answered (in general) without knowing the specifics
one hint related to how the JVM keeps soft references -XX:SoftRefLRUPolicyMSPerMB=250 it's 1000 (1 seconds) by default thus decreasing it means they will live shorter ... but it might just all relate to when they're collected (depends on GC and Java version I guess) so in the end you might find out to be fixing the symptom and not the real cause (as noted things such these can not be generalized esp. knowing very little about the app and/or JVM OPTS used)

Replication Factor to use for system_auth

When using internal security with Cassandra, what replication factor do you use for system_auth?
The older docs seem to suggest it should be N, where N is the number of nodes, while the newer ones suggest we should set it to a number greater than 1. I can understand why it makes sense for it to be higher - if a partition occurs and one section doesn't have a replica, nobody can log in.
However, does it need to be all nodes? What are the downsides of setting it to all ndoes?
Let me answer this question by posing another:
If (due to some unforseen event) all of your nodes went down, except for one; would you still want to be able to log into (and use) that node?
This is why I actually do ensure that my system_auth keyspace replicates to all of my nodes. You can't predict node failure, and in the interests of keeping your application running, it's better safe than sorry.
I don't see any glaring downsides in doing so. The system_auth keyspace isn't very big (mine is 20kb) so it doesn't take up a lot of space. The only possible scenario, would be if one of the nodes is down, and a write operation is made to a column family in system_auth (in which case, I think the write gets rejected...depending on your write consistency). Either way system_auth isn't a very write-heavy keyspace. So you'll be ok as long as you don't plan on performing user maintenance during a node failure.
Setting the replication factor of system_auth to the number of nodes in the cluster should be ok. At the very least, I would say you should make sure it replicates to all of your data centers.
In case you were still wondering about this part of your question:
The older docs seem to suggest it should be N where n is the number of nodes, while the newer ones suggest we should set it to a number greater than 1."
I stumbled across this today in the 2.1 documentation Configuring Authentication:
Increase the replication factor for the system_auth keyspace to N
(number of nodes).
Just making sure that recommendation was clear.
Addendum 20181108
So I originally answered this back when the largest cluster I had ever managed had 10 nodes. Four years later, after spending three of those managing large (100+) node clusters for a major retailer, my opinions on this have changed somewhat. I can say that I no longer agree with this statement of mine from four years ago.
This is why I actually do ensure that my system_auth keyspace replicates to all of my nodes.
A few times on mind-to-large (20-50 nodes) clusters , we have deployed system_auth with RFs as high as 8. It works as long as you're not moving nodes around, and assumes that the default cassandra/cassandra user is no longer in-play.
The drawbacks were seen on clusters which have a tendency to fluctuate in size. Of course, clusters which change in size usually do so because of high throughput requirements across multiple providers, further complicating things. We noticed that occasionally, the application teams would report auth failures on such clusters. We were able to quickly rectify these situation, by running a SELECT COUNT(*) on all system_auth tables, thus forcing a read repair. But the issue would tend to resurface the next time we added/removed several nodes.
Due to issues that can happen with larger clusters which fluctuate in size, we now treat system_auth like we do any other keyspace. That is, we set system_auth's RF to 3 in each DC.
That seems to work really well. It gets you around the problems that come with having too many replicas to manage in a high-throughput, dynamic environment. After all, if RF=3 is good enough for your application's data, it's probably good enough for system_auth.
The reason the recommendation changed was that a quorum query would require responses from over half of your nodes to fullfill. So if you accidentally left Cassandra user active and you have 80 nodes - we need 41 responses.
Whilst it's good practice to avoid using the super user like that - you'd be surprised how often it's still out there.

Associative cache simulation - Dealing with a Faulty Scheme

While working on simulating a fully associative cache (in MIPS assembly), a couple of questions came to mind based on some information read online;
According to some notes from the University of Maryland
Finding a slot: At most, one slot should match. If
there is more than one slot that
matches, then you have a faulty
fully-associative cache scheme. You
should never have more than one copy
of the cache line in any slot of a
fully-associative cache. It's hard to
maintain multiple copies, and doesn't
make sense. The slots could be used
for other cache lines.
Does that mean that I should check all the time the whole tag list in order to check for a second match? After all if I don't, i will never "realize" about the fault with the cache, yet, checking every single time seems quite inefficient.
In the case I do check, and somehow I manage to find a second match, meaning faulty cache scheme, what shall I do then? Although the best answer would be to fix my implementation, yet Im interested on how to handle it during execution if this situation should arise.
If more than one valid slot matches an address, then that means that when a previous search for the same address was executed, either a valid slot that should have matched the address was not used (perhaps because it was not checked in the first place) or more than one invalid slot was used to store the line that wasn't in the cache at all.
Without a doubt, this should be considered a bug.
But if we've just decided not to fix the bug (maybe we'd rather not commit that much hardware to a better implementation) the most obvious option is to pick one of the slots to invalidate. It will then be available for other cache lines.
As for how to pick which one to invalidate, if one of the duplicate lines is clean, invalidate that one in preference to a dirty cache line. If more than cache line is dirty and they disagree you have an even bigger bug to fix, but at any rate your cache is out of sync and it probably doesn't matter which you pick.
Edit: here's how I might implement hardware to do this:
First off, it doesn't make a whole lot of sense to start with the assumption of duplicates, rather we'll work around that at the appropriate time later. There are a few possibilities of what must happen when caching a new line.
The line is already in the cache, no action is needed
The line is not in the cache but there are invalid slots available: Place the new line into one of the available slots
The line is not in the cache but there are no invalid slots available. Another valid line must be evicted and the new line takes its place.
Picking an eviction candidate has performance consequences. Clean cache lines can be evicted for free, but if chosen poorly, it can cause another cache miss in the near future. Consider if all but one cache line is dirty. If only the clean cache line is evicted, then many sequential reads alternating between two addresses will cause a cache miss on every read. Cache invalidation is among the two hard problems in Comp Sci (the other being 'naming things') and out of the scope of this exact question.
I would probably implement a search that checks for the correct slot to act on for each of these. Then another block would pick the first from that list and act on it.
Now, getting back to the question. What are the conditions under which duplicates could possibly enter the cache. If memory accesses are strictly ordered, and the implementation (as above) is correct, I don't think duplicates are possible at all. And thus there's no need to check for them.
Now lets consider a more implausible case where A single cache is shared across two CPU cores. We're going to just do the simplest thing that could work and duplicate everything except the cache memory itself for each core. Thus the slot searching hardware is not shared. To support this, an extra bit per slot is used as a mutex. search hardware cannot use a slot that is locked by the other core. specifically,
If the address is in the cache, try to lock the slot and return that slot. If the slot is already locked, stall until it is free.
If the address is not in the cache, find an unlocked slot that is invalid or valid but evictable.
in this case we actually can end up in a position where two slots share the same address. If both cores try to write to an address that is not in the cache, they will end up getting different slots, and a duplicate line will occur. First lets think about what could happen:
Both lines were reads from main memory. They will be the same value and they will both be clean. It is correct to evict either.
Both lines were writes. Both will be dirty, but probably not be equal. This is a race condition that should have been resolved by the application by issuing memory fences or some other memory ordering instructions. We cannot guess which one should be used, if there was no cache the race condition would persist into RAM. It is correct to evict either.
One line was a read and one was a write. The write is dirty but the read is clean. Once again this race condition would have persisted into RAM if there was no intervening cache, but the reader could have seen a different value. evicting the clean line is right by RAM, and also has the side effect of always favoring read then write ordering.
So now we know what to do about it, but where does this logic belong. First lets think about what could happen if we don't do anything. A subsequent cache access for the same address on either core could return either line. Even if neither core is issuing writes, reads could keep coming up different, alternating between the two values. This breaks every conceivable idea about memory ordering.
one solution might be to just say that dirty lines belong to one core only, the line is not dirty, but dirty and owned by another core.
In the case of two concurrent reads, both lines are identical, unlocked and interchangeable. It doesn't matter which line a core gets for subsequent operations.
in the case of concurrent writes, both lines are out of sync, but mutually invisible. Although the race condition that this creates is unfortunate, it still leads to a reasonable memory ordering, as if all of the operations that happen on the discarded line happened before any of the operations on the cleaned line.
If a read and a write happen concurrently, the dirty line is invisible to the reading core. However, the clean line is visible to both cores, and would cause memory ordering to break down for the writer. future writes could even cause it to lock both (because both would be dirty).
That last case pretty much militates that dirty lines be preferred to clean ones. This forces at least some extra hardware to look for dirty lines first and clean lines only if no dirty lines were found. So now we have a new concurrent cache implementation:
If the address is in the cache and dirty and owned by the requesting core, use that slot
if the address is in the cache but clean
for reads, just use that slot
for writes, mark the slot as dirty and use that slot
if the address is not in the cache and there are invalid slots, use an invalid slot
if there are no invalid slots, evict a line and use that slot.
We're getting closer, there's still a hole in the implementation. What if both cores access the same address but not concurrently. The simplest thing is probably to just say that dirty lines are really invisible to other cores. In cache but dirty is the same as not being in the cache at all.
Now all we have to think about is actually providing the tool for applications to synchronize. I'd probably do a tool that just explicitly flushes a line if it is dirty. This would just invoke the same hardware that is used during eviction, but marks the line as clean instead of invalid.
To make a long post short, the idea is to deal with the duplicates not by removing them, but by making sure they cannot lead to further memory ordering issues, and leaving the deduplication work to the application or eventual eviction.

What are the advantages of memory-mapped files?

I've been researching memory mapped files for a project and would appreciate any thoughts from people who have either used them before, or decided against using them, and why?
In particular, I am concerned about the following, in order of importance:
concurrency
random access
performance
ease of use
portability
I think the advantage is really that you reduce the amount of data copying required over traditional methods of reading a file.
If your application can use the data "in place" in a memory-mapped file, it can come in without being copied; if you use a system call (e.g. Linux's pread() ) then that typically involves the kernel copying the data from its own buffers into user space. This extra copying not only takes time, but decreases the effectiveness of the CPU's caches by accessing this extra copy of the data.
If the data actually have to be read from the disc (as in physical I/O), then the OS still has to read them in, a page fault probably isn't any better performance-wise than a system call, but if they don't (i.e. already in the OS cache), performance should in theory be much better.
On the downside, there's no asynchronous interface to memory-mapped files - if you attempt to access a page which isn't mapped in, it generates a page fault then makes the thread wait for the I/O.
The obvious disadvantage to memory mapped files is on a 32-bit OS - you can easily run out of address space.
I have used a memory mapped file to implement an 'auto complete' feature while the user is typing. I have well over 1 million product part numbers stored in a single index file. The file has some typical header information but the bulk of the file is a giant array of fixed size records sorted on the key field.
At runtime the file is memory mapped, cast to a C-style struct array, and we do a binary search to find matching part numbers as the user types. Only a few memory pages of the file are actually read from disk -- whichever pages are hit during the binary search.
Concurrency - I had an implementation problem where it would sometimes memory map the file multiple times in the same process space. This was a problem as I recall because sometimes the system couldn't find a large enough free block of virtual memory to map the file to. The solution was to only map the file once and thunk all calls to it. In retrospect using a full blown Windows service would of been cool.
Random Access - The binary search is certainly random access and lightning fast
Performance - The lookup is extremely fast. As users type a popup window displays a list of matching product part numbers, the list shrinks as they continue to type. There is no noticeable lag while typing.
Memory mapped files can be used to either replace read/write access, or to support concurrent sharing. When you use them for one mechanism, you get the other as well.
Rather than lseeking and writing and reading around in a file, you map it into memory and simply access the bits where you expect them to be.
This can be very handy, and depending on the virtual memory interface can improve performance. The performance improvement can occur because the operating system now gets to manage this former "file I/O" along with all your other programmatic memory access, and can (in theory) leverage the paging algorithms and so forth that it is already using to support virtual memory for the rest of your program. It does, however, depend on the quality of your underlying virtual memory system. Anecdotes I have heard say that the Solaris and *BSD virtual memory systems may show better performance improvements than the VM system of Linux--but I have no empirical data to back this up. YMMV.
Concurrency comes into the picture when you consider the possibility of multiple processes using the same "file" through mapped memory. In the read/write model, if two processes wrote to the same area of the file, you could be pretty much assured that one of the process's data would arrive in the file, overwriting the other process' data. You'd get one, or the other--but not some weird intermingling. I have to admit I am not sure whether this is behavior mandated by any standard, but it is something you could pretty much rely on. (It's actually agood followup question!)
In the mapped world, in contrast, imagine two processes both "writing". They do so by doing "memory stores", which result in the O/S paging the data out to disk--eventually. But in the meantime, overlapping writes can be expected to occur.
Here's an example. Say I have two processes both writing 8 bytes at offset 1024. Process 1 is writing '11111111' and process 2 is writing '22222222'. If they use file I/O, then you can imagine, deep down in the O/S, there is a buffer full of 1s, and a buffer full of 2s, both headed to the same place on disk. One of them is going to get there first, and the other one second. In this case, the second one wins. However, if I am using the memory-mapped file approach, process 1 is going to go a memory store of 4 bytes, followed by another memory store of 4 bytes (let's assume that't the maximum memory store size). Process 2 will be doing the same thing. Based on when the processes run, you can expect to see any of the following:
11111111
22222222
11112222
22221111
The solution to this is to use explicit mutual exclusion--which is probably a good idea in any event. You were sort of relying on the O/S to do "the right thing" in the read/write file I/O case, anyway.
The classing mutual exclusion primitive is the mutex. For memory mapped files, I'd suggest you look at a memory-mapped mutex, available using (e.g.) pthread_mutex_init().
Edit with one gotcha: When you are using mapped files, there is a temptation to embed pointers to the data in the file, in the file itself (think linked list stored in the mapped file). You don't want to do that, as the file may be mapped at different absolute addresses at different times, or in different processes. Instead, use offsets within the mapped file.
Concurrency would be an issue.
Random access is easier
Performance is good to great.
Ease of use. Not as good.
Portability - not so hot.
I've used them on a Sun system a long time ago, and those are my thoughts.