What is requestBufferSize of couchbase 2.0 ? Doc says its circular buffer for I/O but is it number of key-calue pair of size of key-value pair ? if it is size then is it KB or MB ?
I think the size in bytes. On several discussions/bug reports I've seen terminology used as "default 16K buffer".
Now looking at the description:
Default: 16384. The size of the request ring buffer where all request initially are
stored and then picked up to be pushed onto the I/O threads. Tuning
this to a lower value will more quickly lead to BackpressureExceptions
during overload or failure scenarios. Setting it to a higher value
means backpressure will take longer to occur, but more requests will
potentially be queued up and more heap space is used.
You can see that its referred as "default 16K buffer".
Related
Knowing hardware limits is useful for understanding if your code is performing optimally. The global device memory bandwidth limits how many bytes you can read per second, and you can approach this limit if the chunks you are reading are large enough.
But suppose you are reading, in parallel, N chunks of D bytes each, scattered in random locations in global device memory. Is there a useful formula limiting how much of the bandwidth you'd be able to achieve then?
let's assume:
we are talking about accesses from device code
a chunk of D bytes means D contiguous bytes
when reading a chunk, the read operation is fully coalesced - those bytes are read 4 bytes per thread, by however many adjacent threads in the block are predicted by D/4.
the temporal and spatial characteristics are such that no two chunks are within 32 bytes of each other - either they are all gapped by that much, or else the distribution of loads in time is such that the L2 doesn't provide any benefit. Pretty much saying the L2 hitrate is zero. This seems evident in your statement "global device memory bandwidth" - if the L2 hitrate is not zero, you're not measuring (purely) global device memory bandwidth
we are talking about a relatively recent GPU architecture, say Pascal or newer, or else for an older architecture the L1 is disabled for global loads. Pretty much saying the L1 hitrate is zero.
the overall footprint is not so large as to thrash the TLB
the starting address of each chunk is aligned to a 32-byte boundary (&)
your GPU is sufficiently saturated with warps and blocks to make full use of all resources (e.g. all SMs, all SM partitions, etc.)
the actual chunk access pattern (distribution of addresses) does not result in partition camping or some other hard-to-predict effect
In that case, you can simply round the chunk size D up to the next multiple of 32, and do a calculation based on that. What does that mean?
The predicted bandwidth (B) is:
Bd = the device memory bandwidth of your GPU as indicated by deviceQuery
B = Bd/(((D+31)/32)*32)
And the resultant units there is chunks/sec. (bytes/sec divided by bytes/chunk). The second division operation shown is "integer division", i.e. dropping any fractional part.
(&) In the case where we don't want this assumption, the worst case is to add an additional 32-byte segment per chunk. The formula then becomes:
B = Bd/((((D+31)/32)+1)*32)
note that this condition cannot apply when the chunk size is less than 34 bytes.
All I am really doing here is calculating the number of 32-byte DRAM transactions that would be generated by a stream of such requests, and using that to "derate" the observed peak (100% coalesced/100% utilized) case.
Under #RobertCrovella's assumptions, and assuming the chunk sizes are multiples of 32 bytes and chunks are 32-byte aligned, you will get the same bandwidth as for a single chunk - as Robert's formula tells you. So, no benefit and no detriment.
But ensuring these assumptions hold is often not trivial (even merely ensuring coalesced memory reads).
When you want to train a net you will get a log information like:
Memory required for data: 493376512
How do you interpret the number? Is it in bytes, bits?
To answer your direct question, the memory usage for the layer is given in bytes, not bits.
Overall, the memory reported is all of the layer's required memory that can be computed at initialization. The parameter and buffer space is allocated as needed, so it's not useful to report those in an aggregate sum -- some of them might take less total memory due to serialized re-use.
As the one answer says, the most effective way to determine the max memory usage is to run a forward pass and grab its high-water mark.
See shelhamer's comment:
https://github.com/BVLC/caffe/issues/2387#issuecomment-97910200
It's the memory required for all the top blobs or "data" in the sense
of layer outputs. It excludes the diffs, the parameters, and any
intermediate blobs within layers. #jeffdonahue do you remember the
motivation for this number?
And longjon:
Since layers are individually responsible for allocating parameters
and buffers, there's no way in general to know the total amount of
memory required except by running a forward pass.
The MySQL manual states that a field with the type date takes up three bytes, but if the date is 0000-00-00, does it still take up those bytes? If so, is there an advised method to reduce storage, such as setting the field to NULL?
InnoDB (which you should be using; do not use MyISAM) uses zero bytes for a field that is NULL, but the full number of bytes for a zero value date.
Using NULL may allow InnoDB to store a few more rows per page. I say may because you might have a bunch of other non-null fields per row, so the ratio of savings will be small. If you can do this, InnoDB can fit more rows in the same size buffer pool, thus incur less frequent I/O to read pages (because they stay in the buffer pool), thus get more performance.
Those are a lot of conditions and caveats. The net benefit to performance is likely to be very modest.
I suggest this should not be the focus of your optimization efforts. You'll get better bang for the buck by concentrating on:
Analyzing queries so you can choose the right indexes.
Use Memcached for caching on a case-by-case basis in your app.
Design application architecture for better scaling.
Upgrade your system RAM and buffer pool size, until it holds more of your database pages.
I have a Couchbase (v 2.0.1) cluster with the following specifications:
5 Nodes
1 Bucket
16 GB Ram per node (80GB Total)
200GB Disk per node (1Tb Total)
Currently I have 201.000.000 documents in this bucket and only 200GB of disk in use.
I'm getting the following warning every minute for every node:
Metadata overhead warning. Over 51% of RAM allocated to bucket "my-bucket" on node "my-node" is taken up by keys and metadata.
The Couchbase documentation states the following:
Indicates that a bucket is now using more than 50% of the allocated
RAM for storing metadata and keys, reducing the amount of RAM
available for data values.
I understand that this could be a helpful indicator that I may need to add nodes to my cluster but I think this should not be necessary given the amount of resources available to the bucket.
General Bucket Analytics:
How could I know what is generating so much metadata?
Is there any way to configure the tolerance percentage?
Every document has metadata and a key stored in memory. The metadata is 56 bytes. Add that to your average key size and multiply the result times your document count to arrive at the total bytes for metadata and key in memory. So the RAM required is affected by the doc count, your key size, and the number of copies (replica count + 1). You can find details at http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#memory-quota. The specific formula there is:
(documents_num) * (metadata_per_document + ID_size) * (no_of_copies)
You can get details about the user and metadata being used by your cluster from the console (or via REST or command line interface). Look at the 'VBUCKET RESOURCES' section. The specific values of interest are 'user data in RAM' and 'metadata in RAM'. From your screenshot, you are definitely running up against your memory capacity. You are over the low water mark, so the system will eject inactive replica documents from memory. If you cross the high water mark, the system will then start ejecting active documents from memory until it reaches the low water mark. any requests for ejected documents will then require a background disk fetch. From your screenshot, you have less than 5% of your active documents in memory already.
It is possible to change the warning metadata warning threshold in the 2.5.1 release. There is a script you can use located at https://gist.github.com/fprimex/11368614. Or you can simply leverage the curl command from the script and plug in the right values for your cluster. As far as I know, this will not work prior to 2.5.1.
Please keep in mind that while these alerts (max overhead and max disk usage) are now tunable, they are there for a reason. Hitting either of these alerts (especially in production) at the default values is a major cause for concern and should be dealt with as soon as possible by increasing RAM and/or disk on every node, or adding nodes. The values are tunable for special cases. Even in development/testing scenarios, your nodes' performance may be significantly impaired if you are hitting these alerts. For example, don't draw conclusions about benchmark results if your nodes' RAM is over 50% consumed by metadata.
i have a data array that is per-block.
i have N blocks inside a cuda Grid and a constant array of data "block_data[]" with size N.
so, all threads in a given block 'X' access block_data[X] just one time, and do something with that value.
my question is: does this broadcast scheme work efficiently?
if not, what approach should i take?
edit after comments: my only problem with constant memory is its limited size, since i could have more than 64K blocks. That would mean more than 64KB
regards
If you just use a normal global memory access then the transaction is fairly inefficient, although depending on how much work your kernel is doing the impact is probably quite small.
I'm assuming sizeof(block_data) is one byte (inferred from your question "...could have more than 64K blocks. That would mean more than 64KB").
If the operation is cached in L1 then you will fetch 128 bytes for the one bit of info you need (sizeof(block_data)), if other warps in the block request the same data then they should get from L1. The efficiency of the load is 1/128 but you should only pay that once for the block.
If the operation is not cached in L1 (e.g. you pass "-dlcm=cg" to the assembler) then you will fetch 32 bytes. The efficiency is 1/32 but you pay that once for each warp.
Once the data is loaded, it is broadcast to all threads in the warp.
An alternative would be to mark the data as const __restrict__ which indicates to the compiler that the data is a) read-only and b) not aliased by any other pointer. Since the compiler can detect that the access is uniform then it can optimise the access to use one of the read-only caches (e.g. constant cache or, on compute capability >=3.5, read-only data cache aka texture cache).
If you want to change the values in block_data[N] array, better use the concept of shared memory __shared__. If you are not changing the value of block_data[N], use __const__ or use the concept of cache. By using L2 Cache, you can get 1536KB of memory (Kepler).