I am observing that the cbbackupmgr utility is getting paused while doing a restore for a bucket. This is happening when the resident ratio becomes 0. How and why does the resident ratio becomes 0 here although I see there is data in the memory?
Note: The bucket data is about 3 GB and the bucket quota is set to 20% of this, so around 600 MB. Bucket quota is set so that the resident ratio doesn't exceed 20%.
Related
The question is:
A wide bus configuration has the following parameters:
Number of cycles to send the address
Number of cycles for a bus transfer = 2 cycles
Memory Access = 30 cycles
How many cycles are need to transfer a block of 32 bytes?
So, since it's a wide bus configuration I assumed that the bus transfer cycle will be done over one iteration and same for memory access
Which means that I got 30 + 2 cycles = 32
However I can't make sense of the size of the bus and its impact. I can't understand how i can calculate the the number of cycles left from it
I'm using HardHat with gas-report but I'm not able to understand the following results:
Optimizer enabled: false
Runs: 200
Block limit: 30000000 gas
% of limit
Here I have marked with red square the fields:
enter image description here
Optimizer (whether it's enabled or disabled) and the target amount of contract runs, to which the optimizer should optimize the contract bytecode, are options of the Solidity compiler. When you compile the contract with an optimizer, it can decrease either the total bytecode size - or the amount of gas needed to execute some functions. (docs)
Block limit states the amount of gas units that can fit into one block. Different networks might have different values, some have dynamically adjusted limits, plus you can usually set your own limit if you're using an emulator or a private network. (docs)
% of limit states a portion of how much your contract deployment took in the total block limit. Example from your table: Deployment of HashContract cost 611k gas units, which is approx. 2% of the 30M block limit. If the number exceeds 100%, the transaction would never be included in a block - at least not in a block with the same or smaller limit. Also, if the transaction has a low gasPrice and a high % of the total block limit, some miners/validators might not be able to fit the transaction into a block (as transactions are usually ordered from highest gasPrice to lowest), so it might take longer to be included in a block.
Nvidia web-site mentions a few causes of low achieved occupancy, among them uneven distribution of workload among blocks, which results in blocks hoarding shared memory resources and not releasing them until block is finished. The suggestion is to decrease the size of a block, thus increasing the overall number of blocks (given that we keep the number of threads constant, of course).
A good explanation on that was also given here on stackoverflow.
Given aforementioned information, shouldn't the right course of actions be (in order to maximize performance) simply setting the size of a block as small as possible (equal to the size of a warp, say 32 threads)? That is, unless you need to make sure that a larger number of threads needs to communicate through shared memory, I assume.
Given aforementioned information, shouldn't the right course of
actions be (in order to maximize performance) simply setting the size
of a block as small as possible (equal to the size of a warp, say 32
threads)?
No.
As shown in the documentation here, there is a limit on the number of blocks per multiprocessor which would leave you with a maximum theoretical occupancy of 25% or 50% when using 32 thread blocks, depending on what hardware you run the kernel on.
Usually it is a good approach to use as small blocks as possbile but big enough to saturate device (64 or 128 threads per block depending on device) - it is not always possible since you might want to synchronize threads or communicate via shared memory.
Having large number of small blocks allows GPU to do kind of "autobalancing" and keep all SMs running.
The same applies to CPU - if you have 5 independent taks and each takes 4 seconds to finish, but you have only 4 cores then it will end after 8 seconds(during first 4 seconds 4 cores are running on first 4 tasks and then 1 core is running on last task and 3 cores are idling).
If you are able to divide whole job to 20 tasks that take 1 second then whole job will be done in 5 seconds. So having a lot of small tasks helps to utilize hardware.
In case of GPU you can have large number of active blocks (on Titan X it is 24 SM x 32 active blocks = 768 blocks) and would be good to use this power.
Anyway it is not always true that you need to fully saturate device. On many tasks I can see that using 32 threads per block (so having 50% possible occupancy) gives same performance as using 64 threads per block.
In the end all is a matter of doing some benchmarks, and choosing whatever is best for you in given case with given hardware.
I have a Couchbase (v 2.0.1) cluster with the following specifications:
5 Nodes
1 Bucket
16 GB Ram per node (80GB Total)
200GB Disk per node (1Tb Total)
Currently I have 201.000.000 documents in this bucket and only 200GB of disk in use.
I'm getting the following warning every minute for every node:
Metadata overhead warning. Over 51% of RAM allocated to bucket "my-bucket" on node "my-node" is taken up by keys and metadata.
The Couchbase documentation states the following:
Indicates that a bucket is now using more than 50% of the allocated
RAM for storing metadata and keys, reducing the amount of RAM
available for data values.
I understand that this could be a helpful indicator that I may need to add nodes to my cluster but I think this should not be necessary given the amount of resources available to the bucket.
General Bucket Analytics:
How could I know what is generating so much metadata?
Is there any way to configure the tolerance percentage?
Every document has metadata and a key stored in memory. The metadata is 56 bytes. Add that to your average key size and multiply the result times your document count to arrive at the total bytes for metadata and key in memory. So the RAM required is affected by the doc count, your key size, and the number of copies (replica count + 1). You can find details at http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#memory-quota. The specific formula there is:
(documents_num) * (metadata_per_document + ID_size) * (no_of_copies)
You can get details about the user and metadata being used by your cluster from the console (or via REST or command line interface). Look at the 'VBUCKET RESOURCES' section. The specific values of interest are 'user data in RAM' and 'metadata in RAM'. From your screenshot, you are definitely running up against your memory capacity. You are over the low water mark, so the system will eject inactive replica documents from memory. If you cross the high water mark, the system will then start ejecting active documents from memory until it reaches the low water mark. any requests for ejected documents will then require a background disk fetch. From your screenshot, you have less than 5% of your active documents in memory already.
It is possible to change the warning metadata warning threshold in the 2.5.1 release. There is a script you can use located at https://gist.github.com/fprimex/11368614. Or you can simply leverage the curl command from the script and plug in the right values for your cluster. As far as I know, this will not work prior to 2.5.1.
Please keep in mind that while these alerts (max overhead and max disk usage) are now tunable, they are there for a reason. Hitting either of these alerts (especially in production) at the default values is a major cause for concern and should be dealt with as soon as possible by increasing RAM and/or disk on every node, or adding nodes. The values are tunable for special cases. Even in development/testing scenarios, your nodes' performance may be significantly impaired if you are hitting these alerts. For example, don't draw conclusions about benchmark results if your nodes' RAM is over 50% consumed by metadata.
I used cudaMalloc to allocate array of integers of size 100, i.e. total I have
int_total_bytes=100*sizeof(int),
and to allocate array of doubles of size 1000, i.e. total I have
db_total_bytes=1000*sizeof(double),...
Can I be sure that total of global memory used on gpu would be
int_total_bytes+db_total_bytes?
thanks!
Several situations can make the actual size of memory allocated larger than the calculated sizes due to padding added to achieve optimum address alignment or due to minimum block sizes.
For the two examples you give, the data sizes are compatible with natural alignment sizes and boundaries so you probably won't see much difference between calculated and actual memory used. There may still be some variation, though, if cudaMalloc uses a suballocator - if it allocates a large block from the OS (or device), then subdivides that large block into smaller blocks to fill cudaMalloc() requests.
If a suballocator is involved, then the OS will show the actual memory use as considerably larger than your calculated use, but actual use will remain stable even as your app makes multiple small allocations (which can be filled from the already allocated large block).
Similarly, the hardware typically has a minimum allocation size which is usually the same as the memory page size. If the smallest chunk of memory that can be allocated from hardware is, say, 64K, then when you ask for 3k you've got 61K that's allocated but not being used. This is where a suballocator would be useful, to make sure you use as much as you can of the memory blocks you allocate.
In addition to what dthorpe said, you can check the GPU memory usage of a process with the nvidia-smi command.