I have a Couchbase (v 2.0.1) cluster with the following specifications:
5 Nodes
1 Bucket
16 GB Ram per node (80GB Total)
200GB Disk per node (1Tb Total)
Currently I have 201.000.000 documents in this bucket and only 200GB of disk in use.
I'm getting the following warning every minute for every node:
Metadata overhead warning. Over 51% of RAM allocated to bucket "my-bucket" on node "my-node" is taken up by keys and metadata.
The Couchbase documentation states the following:
Indicates that a bucket is now using more than 50% of the allocated
RAM for storing metadata and keys, reducing the amount of RAM
available for data values.
I understand that this could be a helpful indicator that I may need to add nodes to my cluster but I think this should not be necessary given the amount of resources available to the bucket.
General Bucket Analytics:
How could I know what is generating so much metadata?
Is there any way to configure the tolerance percentage?
Every document has metadata and a key stored in memory. The metadata is 56 bytes. Add that to your average key size and multiply the result times your document count to arrive at the total bytes for metadata and key in memory. So the RAM required is affected by the doc count, your key size, and the number of copies (replica count + 1). You can find details at http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#memory-quota. The specific formula there is:
(documents_num) * (metadata_per_document + ID_size) * (no_of_copies)
You can get details about the user and metadata being used by your cluster from the console (or via REST or command line interface). Look at the 'VBUCKET RESOURCES' section. The specific values of interest are 'user data in RAM' and 'metadata in RAM'. From your screenshot, you are definitely running up against your memory capacity. You are over the low water mark, so the system will eject inactive replica documents from memory. If you cross the high water mark, the system will then start ejecting active documents from memory until it reaches the low water mark. any requests for ejected documents will then require a background disk fetch. From your screenshot, you have less than 5% of your active documents in memory already.
It is possible to change the warning metadata warning threshold in the 2.5.1 release. There is a script you can use located at https://gist.github.com/fprimex/11368614. Or you can simply leverage the curl command from the script and plug in the right values for your cluster. As far as I know, this will not work prior to 2.5.1.
Please keep in mind that while these alerts (max overhead and max disk usage) are now tunable, they are there for a reason. Hitting either of these alerts (especially in production) at the default values is a major cause for concern and should be dealt with as soon as possible by increasing RAM and/or disk on every node, or adding nodes. The values are tunable for special cases. Even in development/testing scenarios, your nodes' performance may be significantly impaired if you are hitting these alerts. For example, don't draw conclusions about benchmark results if your nodes' RAM is over 50% consumed by metadata.
Related
So i want to understand how DBMS implementation works
To give an example :
MySQL implements each tables with its own pages, which are 16KB
so each table is a file, and is a multiple of 16KB, considering how large is it and therefore how many pages it needs
Now i read somewhere that these pages don't get fragmented in disk image or memory image, so my question is, HOW?
how do DBMS developers tell the operating system that "hey i just added a 16KB data (page) to this file, but make this page doesn't get fragmented"
is it because the memory image doesn't actually show how bytes are really stored on disk and its logical?
or is it because these DBMSes somehow ask the O.S to make sure these chunks of 16KB bytes do not get fragmented?
and how to do this in C?
50 years ago, your question was a hot topic in Computer Science and Engineering. But not today.
Virtually every hard drive has an allocation unit of 512 bytes. CDs have an AU of 2KB. Certain SSDs, when tuned for MySQL, have an AU of 16KB.
There are many different "filesystems". Windows has (at least) FAT-32 and NTFS. *nix has lots. Each FS prides itself with doing a better job at something. But freespace management fights with allocation unit size. Remember the hassle DOS had with it's FAT-16 while disks were getting bigger and bigger? The "16" in the name refers to the disk having up to 2^16 blocks. That forced a 2GB disk drive had an allocation unit of 32KB! The typical system had a lot of little files, literally half the disk was probably wasted!
I'm talking about "Allocation Units" because that is essentially the only way to prevent the OS from thinking about scattering the blocks around the drive.
Let's look at your question from a marketing point of view. If fragmentation is such a big deal, then
New, better, filesystems would come along to solve the problem -- though not necessarily in the simplistic way you mentioned.
Operating systems have known about the problem, so they have ways of "trying" to allocate in chunks. But they are always willing to give you little pieces when necessary.
MySQL's InnoDB (circa 2000) went to a lot of effort of allocating in "extents" of 4MB(?) in hopes of getting contiguously allocated disk. But when it failed, nothing would crash.
Software would bypass the problem, such as by using "raw drive access. But notice how that is not in the forefront of "how to optimize your database"? If even available, it is buried in the "Oh, by the way" chapter.
A few decades ago, there were some OSs that would let you pre-allocate a file that was "contiguous". I have not heard of such recently.
Enterprise systems solved the issue by using hardware RAID controllers with Battery Backed Write Cache. Not only would the scatter-gather be hidden from the user, but writes became 'instantaneous' because of the crash-safe cache.
SSDs don't have any seek time (unlike HDDs), so it really does not matter if a block is chopped up. Sure, there is some code to deal with it, but that is really insignificant compared to the transfer, checksum, mutex, system call, etc, etc, times.
I have a Rule of Thumb: If a potential optimization does not look like it will help by 10%, I drop it and move on to something else. I suggest you move on.
how to do this in C:
int add16k(void *My16kDataChunk) {
fd = open(“My.DataBase”, O_WRONLY|O_APPEND);
if (fd != -1) {
write(fd, My16kDataChunk, 16*1024);
close(fd);
}
return fd != -1;
}
But for a data base, you might want to cache your open file descriptors, be able to write at arbitrary offsets, ensure that the data is genuinely recorded(*), etc. Most importantly, you want to ensure that multiple requests do not interfere with each other. In reverse you need: synchronization, fdatasync, pwrite.
memory image doesn’t show ...
It is because the memory image is identical to what is on disk. Two subsystems, the VM and FileSystem co-operate to achieve this. If I write 16k, and the file system has to collect 4 discontiguous 4k sectors to store it, it arranges my read() and write() calls to be oblivious to this layout -- it delivers the data to and from virtually contiguous areas. Similarly, if a 16k buffer requires 4 discontiguous 4k physical memory pages, the VM system arranges the page mappings to construct a contiguous virtual range.
That said, some file systems support mechanisms to pre-allocate physically contiguous regions of the disk [ although with volume management, SANs, virtualization this is mainly pretend ], so that the file system can achieve performance goals and the expense of portability.
(*) - this seems like a fairly simple idea, I can just call fsync(), fdatasync(), sync(), or some such. Yeah, no. This http://blog.httrack.com/blog/2013/11/15/everything-you-always-wanted-to-know-about-fsync/ gives a pretty good treatment of it. TL;DR - OS/FileSystem peoples loose idea of truth would make a compiler vendor blush.
In strictly-conforming C code, you can't. Standard C has a very lose concept of what a file is. Even getting the size of a file can't be done with strictly-conforming C, other than by opening a file in binary mode and reading it byte-by-byte and counting.* There's simply no way to specify how a file will be stored in conforming C code.
So you're left with using system-dependent methods.
POSIX provides the posix_fallocate() function:
SYNOPSIS
#include <fcntl.h>
int posix_fallocate(int fd, off_t offset, off_t len); [Option End]
DESCRIPTION
The posix_fallocate() function shall ensure that any required
storage for regular file data starting at offset and continuing for
len bytes is allocated on the file system storage media. If
posix_fallocate() returns successfully, subsequent writes to the
specified file data shall not fail due to the lack of free space on
the file system storage media.
Note, again, though, that there's no way to ensure the how the underlying system makes sure subsequent writes to the file will succeed.
It's still "implementation-defined" if the file space will be contiguous. It's probably more likely that the space reserved will be allocated by the file system as a contiguous chunk, but there's absolutely no guarantees.
Linux provides the fallocate() function:
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <fcntl.h>
int fallocate(int fd, int mode, off_t offset, off_t len);
DESCRIPTION
This is a nonportable, Linux-specific system call. For the portable,
POSIX.1-specified method of ensuring that space is allocated for a
file, see posix_fallocate(3).
fallocate() allows the caller to directly manipulate the allocated
disk space for the file referred to by fd for the byte range starting
at offset and continuing for len bytes.
...
Note how this is explicitly listed as "nonportable, Linux-specific".
And even the non-portable, Linux-specific fallocate() function provides absolutely no guarantees about contiguous file allocation.
Because actual allocation of space is a file-system dependent operation.
XFS, for example, tries to preallocate file space so data is stored in contiguous blocks.
But again, with no guarantees.
Only high-performance file systems such as Oracle's HSM (Sun's SAM-QFS) and IBM's Spectrum Scale (originally called GPFS) provide the level of control necessary for you to even have any real chance at getting contiguous allocation of space in a file.
For example, the Oracle HSM/QFS setfa() function:
NAME
sam_setfa - Sets attributes on a file or directory
SYNOPSIS
cc [ flag ... ] file ... -L/opt/SUNWsamfs/lib -lsam [library ... ]
#include "/opt/SUNWsamfs/include/lib.h"
int sam_setfa(const char *path, const char *ops);
DESCRIPTION
sam_setfa() sets attributes on a file or directory using a
SAM-QFS system call. path is the file on which to set the
attributes. ops is the character string of options, for
example: "ds1". Individual options are described below.
OPTIONS
A n Specifies the number of bytes to be allocated ahead of
a write to the file. The n must be an integer and must
be greater than or equal to one kilobyte and less than
4 terabytes. The n is rounded down to units of kilo-
bytes. This option is only valid for a regular file.
This option should be used when writing large files
where more sequential allocation is desired. Note,
when the file is closed the blocks are reset to the
size of the file.
...
l n Specifies the number of bytes to be preallocated to the
file. The n must be an integer. This option can only
be applied to a regular file. If an I/O event attempts
to extend a file preallocated with the L option, the
caller receives an ENXIO error. The l option allocates
using extent allocation. This means striping is not
supported and the file is allocated on 1 disk device or
1 striped group. The L and l options are mutually
exclusive. If the file has existing disk blocks, this
option is changed to the L option.
L n Specifies the number of bytes to be preallocated to the
file. The n must be an integer. This option is only
valid for a regular file. The L option allocates using
standard allocation. This means striping is supported.
This also means the file can be extended. The L and l
options are mutually exclusive.
...
Even on high-performance, complex, proprietary file systems that offer a lot of options for how files are stored on disk, it's impossible to guarantee contiguous space allocation within a file.
Which is one reason high-end database can use raw devices for data storage - it's really the only way to guarantee contiguous data storage.
* No, fseek()/ftell() is not strictly-conforming C code. fseek( fp, 0, SEEK_END ) is explicitly undefined behavior on a binary stream, and ftell() can't be used to get the number of bytes in a text file.
Now i read somewhere that these pages don't get fragmented in disk image or memory image, so my question is, HOW?
The database has to preallocate the files.
how do DBMS developers tell the operating system that "hey i just added a 16KB data (page) to this file, but make this page doesn't get fragmented"
That would have to be done through system services.
is it because the memory image doesn't actually show how bytes are really stored on disk and its logical?
No.
or is it because these DBMSes somehow ask the O.S to make sure these chunks of 16KB bytes do not get fragmented?
Again, you have to do a contiguous extension, something that has a high probability of failing.
and how to do this in C?
You cannot do it in standard C. However, any rationally designed operating system will have services that allow you to allocate contiguous files.
PIDs 0101 (monitor status since DTCs cleared) and 0141 (monitor status this drive cycle) are both returning the monitor status, however as per the specification only 0101 differentiates between spark ignition and compression ignition, hence the bit-to-monitor-mapping is different.
As per the standard documents (and Wikipedia1), this distinction is missing in 0141, so how am I supposed to interpret the result of 0141 on a compression ignition vehicle then?
All the PID details are in ISO 15031-5. You have to either buy it (arouond 80$) or find it anyhow!! Infos about PIDs are not complete (even sometimes ambiguous!) on Wikipedia. Below is some information and differences between 0x01 and 0x41 (But not complete and you cannot parse the information with it!). Hopefully helps:
0x01 is Monitor status since DTCs cleared.
The bits in this PID shall report two pieces of information for each monitor:
1) monitor status since DTCs were last cleared, saved in NVRAM or Keep Alive RAM
2) monitors supported on the vehicle.
0x41: the bit in this PID shall report two pieces of information for each monitor:
1) Monitor enable status for the current driving cycle. This bit shall indicate when a monitor is disabled in a manner such that there is no easy way for the driver to operate the vehicle to allow the monitor to run.
Typical examples are
⎯ engine-off soak not long enough (e.g., cold start temperature conditions not satisfied)
⎯ monitor maximum time limit or number of attempts/aborts exceeded
The monitor shall not indicate “disabled” for operator-controlled conditions such as rpm, load, throttle
position. The monitor shall not indicate “disabled” from key-on because minimum time limit has not been exceeded or engine warm-up conditions have not been met, since these conditions will eventually be met as the vehicle continues to be driven.
If the operator drives the vehicle to a different altitude or ambient air temperature conditions, monitor status may change from enabled to disabled. The monitor shall not change from disable to enable if the conditions
change back. This could result in a monitor showing “disabled” but eventually showing “complete”.
2) Monitor completion status for the current driving/monitoring cycle. Status shall be reset to “not complete” upon starting a new monitoring cycle. Note that some monitoring cycles can include various engineoperating conditions; other monitoring cycles begin after the ignition key is turned off. Some status bits on a given vehicle can utilize engine-running monitoring cycles while others can utilize engine-off monitoring cycles. Resetting the bits to “not complete” upon starting the engine will accommodate most engine-running
and engine-off monitoring cycles; however, manufacturers are free to define their own monitoring cycles.
In the latest version of the standard (SAE J1979DA-201406) – which I have since bought – they have clarified that bit by specifying B3 as injection bit for both 0101 and 0141. Thus this question has been solved.
I have Neo4j with quite simple schema. There is only one type of node and one type of relationship which can bind nodes. Each node have one property (indexed) and each relationship has four properties. These are the numbers:
neo4j-sh (?)$ dbinfo -g "Primitive count"
{
"NumberOfNodeIdsInUse": 19713210,
"NumberOfPropertyIdsInUse": 109295019,
"NumberOfRelationshipIdsInUse": 44903404,
"NumberOfRelationshipTypeIdsInUse": 1
}
I run this database on virtual machine with Debian, 7 cores and 26GB of RAM. This is my Neo4j configuration:
neo4j.properties:
neostore.nodestore.db.mapped_memory=3000M
neostore.relationshipstore.db.mapped_memory=4000M
neostore.propertystore.db.mapped_memory=4000M
neostore.propertystore.db.strings.mapped_memory=300M
neostore.propertystore.db.arrays.mapped_memory=300M
neo4j-wrapper.conf:
wrapper.java.additional=-XX:+UseParallelGC
#wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled
wrapper.java.initmemory=2000
wrapper.java.maxmemory=10000
I use UseParallelGC instead of UseConcMarkSweepGC, because I noticed that with UseConcMarkSweepGC only one CPU core is used during query and when I changed to UseParallelGC all cores are utilzed. I do not run any queries in parallel. Only one at a time in neo4j-shell, but mostly concerning the whole set of nodes for example:
match (n:User)-->(k:User)
return n.id, count(k) as degree
order by degree desc limit 100;
and it takes 726230 ms to execute it. I also tried:
match (n:User)-->()-->(k:User)
return n.id, count(DISTINCT k) as degree
order by degree desc limit 100;
but after a long time I get only "Error occurred in server thread; nested exception is:
java.lang.OutOfMemoryError: GC overhead limit exceeded". I did not try queries with restrictions taking into account relationships properties, but it is also planned.
I think that my configuration is not optimal. I noticed that Neo4j uses at most 50% of system memory during query and remaining memory is free. I could change this by setting larger value in wrapper.java.maxmemory, but I have read that I have to leave some memory for mapped_memory setings. However, I am not sure if they are taken into account, because during query there is a lot of free memory. How should I set the configuration for such queries?
Your queries are global queries that get slower with increasing amount of data. For every user node the number of outgoing relationships is calculated, put into a collection and sorted by count. This kind of operation consumes a lot of CPU and memory. Instead of tweaking config I guess you're better off refactoring your graph model.
Depending on your use case consider storing the degree of a user in a property on the user node. Of course any operation adding/removing a relationship for a user needs to be reflected in the degree property. Additionally you might want to index the degree property.
As per the attached, we have a Balanced Data Distributor set up in a data transformation covering about 2 million rows. The script tasks are identical - each one opens a connection to oracle and executes first a delete and then an insert. (This isn't relevant but it's done that way due to parameter issues with the Ole DB command and the Microsoft Ole DB provider for Oracle...)
The issue I'm running into is no matter how large I make my buffers or how many concurrent executions I configure, the BDD will not execute more than five concurrent processes at a time.
I've pulled back hundreds of thousands of rows in a larger buffer, and it just gets divided 5 ways. I've tried this on multiple machines - the current shot is from a 16 core server with -1 concurrent executions configured on the package - and no matter what, it's always 5 parallel jobs.
5 is better than 1, but with 2.5 million rows to insert/update, 15 rows per second at 5 concurrent executions isn't much better than 2-3 rows per second with 1 concurrent execution.
Can I force the BDD to use more paths, and if so how?
Short answer:
Yes BDD can make use of more than five paths. You shouldn't be doing anything special to force it, by definition it should automatically do it for you. Then why isn't it using more than 5 paths? Because your source is producing data faster than your destination can consume causing backpressure. To resolve it, you've to tune your destination components.
Long answer:
In theory, "the BDD takes input data and routes it in equal proportions to it's outputs, however many there are." In your set up, there are 10 outputs. So input data should be equally distributed to all the 10 outputs at the same time and you should see 10 paths executing at the same time - again in theory.
But another concept of BDD is "instead of routing individual rows, the BDD operates on buffers on data." Which means data flow engine initiates a buffer, fills it with as many rows as possible, and moves that buffer to the next component (script destination in your case). As you can see 5 buffers are used each with the same number of rows. If additional buffers were started, you'd have seen more paths being used. SSIS couldn't use additional buffers and ultimately additional paths because of a mechanism called backpressure; it happens when the source produces data faster than the destination can consume it. If it happens all memory would be used up by the source data and SSIS will not have any memory to use for the transformation and destination components. So to avoid it, SSIS limits the number of active buffers. It is set to 5 (can't be changed) which is exactly the number of threads you're seeing.
PS: The text within quotes is from this article
There is a property in SSIS data flow tasks called EngineThreads which determines how many flows can be run concurrently, and its default value is 5 (in SSIS 2012 its default value is 10, so I'm assuming you're using SSIS 2008 or earlier.) The optimal value is dependent on your environment, so some testing will probably be required to figure out what to put there.
Here's a Jamie Thomson article with a bit more detail.
Another interesting thing I've discovered via this article on CodeProject.
[T]his component uses an internal buffer of 9,947 rows (as per the
experiment, I found so) and it is pre-set. There is no way to override
this. As a proof, instead of 10 lac rows, we will use only 9,947 (Nine
thousand nine forty seven ) rows in our input file and will observe
the behavior. After running the package, we will find that all the
rows are being transferred to the first output component and the other
components received nothing.
Now let us increase the number of rows in our input file from 9,947 to
9,948 (Nine thousand nine forty eight). After running the package, we
find that the first output component received 9,947 rows while the
second output component received 1 row.
So I notice in your first buffer run that you pulled 50,000 records. Those got divided into 9,984 record buckets and passed to each output. So essentially the BDD takes the records it gets from the buffer and passes them out in ~10,000 record increments to each output. So in this case perhaps your source is the bottleneck.
Perhaps you'll need to split your original Source query in half and create two BDD-driven data flows to in essence double your parallel throughput.
"The output column "A" (67) on output "Output0" (5) and component "Data Flow Task" (1) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance."
Please resolve my problem
These warnings indicate that you have columns in your data flow that are not used. A Data Flow works by allocating "buckets" of fixed size memory, filling it with data from the source and allowing the downstream components to directly access the memory address to perform synchronous transformations.
Memory is a finite resource. If SSIS detects is has 1 GB to work with and one row of data will cost 4096 MB, then you could have at most 256 rows of data in the pipeline before running out of memory space. Those 256 rows would get split into N buckets of rows because as much as you can, you want to perform set based operations when working with databases.
Why does all this matter? SSIS detects whether you've used everything you've brought into the pipeline. If it's never used, then you're wasting memory. Instead of a single row costing 4096, by excluding unused columns, you reduce the amount of memory required for each row down to 1024 MB and now you can have 1024 rows in the pipeline just by only taking what you needed.
How do you get there? In your data source, write a query instead of selecting a table. Don't use SELECT * FROM myTable instead, explicitly enumerate all of the columns you need and nothing more. Same goes for Flat File Sources---uncheck the columns that are never used. You'll still pay a disk penalty for having to read the whole row in but they don't have to hit your DF and consume that memory. Same story for any Lookups - only query the data you need.
Asynchronous components are the last thing to be aware of as this has turned into a diatribe on performance. The above calculations are much like freshman calculus classes: assume a cow is a sphere to make the math easier. Asynchronous components result in your memory being split before and after the component. They radically change the shape of the rows going through a component such that downstream components can't reuse the address space above it. This results in a physical memory copy which is a slow operation.
My final comment though is if your package is performing adequately, finishing in an acceptable time frame, unless you have nothing else to do, leave it be and go on to your next task. These are just warnings and should not "grow up" to full blown errors.