Large log data in Couchbase - couchbase

I have a couchbase server to storing a huge data.
This data growing daily, but i also daily delete after process it.
Current, this data has about 1320168 items count, with 2.97G of Data Usage
But why Disc Usage is very large with 135G ???
My disc is lowing space to store more data.
Could delete this data log files to reduce disc Usage?

Couchbase uses an append-only format for storage. This means that every update or delete operation is actually stored as a new entry in the storage file and consumes more disk space.
Then a process called compaction occurs, that will reclaim the unnecessary used space. Compaction can either be configured to run automatically, when a certain fragmentation % is reached in your cluster, or manually on each node.
IIRC auto-compaction is not on by default.
So what you probably want to do is run compaction on your cluster. Note that it may require quite a large amount of diskspace, as noted here...
See the doc on how to perform compaction (in your case at the end of business day I guess you have an "off-peak" window where you currently delete and could perform compaction).
PS: Maybe guys in the official forums may have more insight and recommandations to offer.

Related

Should unused tables be archived?

There is a table in our database that takes about 25GB. It is no longer used by the current code.
Will it give any performance improvements (for rest of the tables) if we archive this table, even though it's not queried/used? Please provide explanation.
We are using MySQL with AWS Aurora.
Archiving tables will not have any impact on Aurora. Unused pages are evicted from buffer pool eventually [1], and since then, they never get pulled back onto the db instances, unless you make a query that would touch those pages.
You would continue to pay storage costs (and other in-direct costs like snapshots) by keeping them as unused. A better option would be to move the unused data to a new cluster, create a snapshot out of it, and remove the cluster. You can always recover the data when you need it by restoring a snapshot. The original database can then be cleaned by dropping these unused tables. This way you end up only paying for the snapshot, which is cheaper.
You could also export the data out of mysql (CSV let say) and store it in S3/Glacier. Only caviat is that when you need to access the data, it can end up being a much more time consuming effort to load it back to an existing or new database cluster.
[1] Buffer pool uses LRU for eviction. When you workload runs for long, you would eventually end up evicting all the pages associated with the unused table. Link: https://dev.mysql.com/doc/refman/5.5/en/innodb-buffer-pool.html
Yes, archiving will improve performance also along with reduction in side and quickness of of backup/recovery cycles.
I have tried it on different projects in my recent full time job and results are amazing. For those who deny I would only say:
Reduction in footprint reduce disk IO and scans
Reduction in foot print reduce buffer requirements and hence RAM requirements.
YES, archiving infrequently used data will ease the burden on faster and more frequently accessed data storage systems. Older data that is unlikely to be needed often is put on systems that don’t need to have the speed and accessibility of systems that contain data still in use
Archived data is stored on a lower-cost tier of storage, serving as a way to reduce primary storage consumption and related costs. Typically, data reduplication is performed on data being moved to a lower storage tier, which reduces the overall storage footprint and lowers secondary storage costs

Very frequent couchbase document updates

I'm new to couchbase and was wondering if very frequent updates to a single document (possibly every second) will cause all updates to pass through the disk write queue, or only the last update made to the document?
In other words, does couchbase optimize disk writes by only writing the document to disk once, even if updated multiple time between writes.
Based on the docs, http://docs.couchbase.com/admin/admin/Monitoring/monitor-diskqueue.html, it sounds like all updates are processed. If anyone can confirm this, I'd be grateful.
thanks
Updates are held in a disk queue before being written to disk. If a write to a document occurs and a previous write is still in the disk queue, then the two writes will be coalesced, and only the more recent version will actually be written to disk.
Exactly how fast the disk queue drains will depend on the storage subsystem, so whether writes to the same key get coalesced will depend on how quick the writes come in compared to the storage subsystem speed / node load.
Jako, you should worry more about updates happening in the millisecond time frame or more than one update happening in 1 (one) millisecond. The disc write isn't the problem, Couchbase solves this intelligently itself but the fact that you will concurrency issues when you operate in the milliseconds time frame.
I've run into them fairly easily when I tested my application and first couldn't understand why Node.js (in my case) sometimes would write data to CouchBase and sometimes not. If it didn't write to CouchBase usually for the first record.
More problems raised when I first checked if a document with a specific key existed, upon not existing I would try to write it to CouchBase only to find out that in the meantime an early callback had finished and now there was indeed a key for the same document.
In those case you have to operate with the CAS flag and program it iteratively so that your app is continuously trying to pull the right document for that key and then updates. Keep this in mind especially when running tests and updates to the same document is being done!

Does couchbase actually support datasets larger than memory?

Couchbase documentation says that "Disk persistence enables you to perform backup and restore operations, and enables you to grow your datasets larger than the built-in caching layer," but I can't seem to get it to work.
I am testing Couchbase 2.5.1 on a three node cluster, with a total of 56.4GB memory configured for the bucket. After ~124,000,000 100-byte objects -- about 12GB of raw data -- it stops accepting additional puts. 1 replica is configured.
Is there a magic "go ahead and spill to disk" switch that I'm missing? There are no suspicious entries in the errors log.
It does support data greater than memory - see Ejection and working set management in the manual.
In your instance, what errors are you getting from your application? When you start to reach the low memory watermark, items need to be ejected from memory to make room for newer items.
Depending on the disk speed / rate of incoming items, this can result in TEMP_OOM errors being sent back to the client - telling it needs to temporary back off before performing the set, but these should generally be rare in most instances. Details on handling these can be found in the Developer Guide.
My guess would be that it's not the raw data that is filling up your memory, but the metadata associated with it. Couchbase 2.5 needs 56 bytes per key, so in your case that would be approximately 7GB of metadata, so much less than your memory quota.
But... metadata can be fragmented on memory. If you batch-inserted all the 124M objects in a very short time, I would assume that you got at least a 90% fragmentation. That means that with only 7GB of useful metadata, space required to hold it has filled up your RAM, with lots of unused parts in each allocated block.
The solution to your problem is to defragment the data. It can either be achieved manually or triggered as needed :
manually :
automatically :
If you need more insights about why compaction is needed, you can read this blog article from Couchbase.
Even if none of your documents is stored in RAM, CouchBase still stores all the documents IDs and metadata in memory(this will change in version 3), and also needs some available memory to run efficiently. The relevant section in the docs:
http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#memory-quota
Note that when you use a replica you need twice as much RAM. The formula is roughly:
(56 + avg_size_of_your_doc_ID) * nb_docs * 2 (replica) * (1 + headroom) / (high_water_mark)
So depending on your configuration it's quite possible 124,000,000 documents require 56 Gb of memory.

Can massive writing & deleting files hurt our server performance?

We run a system that for cache purposes, currently writes and deletes about 1,000 small files (10k) every hour.
In the near future this number will raise to about 10,000 - 20,000 files being written and deleted every hour.
For every files that is being written a new row on our mysql DB is added and deleted when the file is deleted an hour later.
My question:
Can this excessive write & delete operation hurt our server performance eventually somehow?
(btw we currently run this on a VPS and soon on a dedicated server.)
Can writing and deleting so many rows eventually slow our DB?
This greatly depends on operating system, file system and configuration of file system caching. Also this depends on whether your database is stored on the same disk as files that are written/deleted.
Usually, operation that affect file system structure such as file creations and file deletions require some synchronous disk IO, so operating system will not loose these changes after power failure. Though, some operating systems and file systems may support more relaxed policy for this. For example, UFS file system on FreeBSD has nice "soft updates" option that does this. Probably etx3/Linus should have similar feature.
Once you will move to dedicated server I think it would be reasonable to attach several HDDs to it and to make sure that database is stored on once disk while massive file operations are performed on another disk. In this cases DB performance should not be affected.
You should make some calculations and estimate needed throughtput for the storage. In your worst scenario, 20000 files x 10K = 200MB per hour which is a very low requirement.
Deleting a file, on modern filesystems, takes a very little time.
In my opinion you don't have to worry, especially if your applications creates and deletes files sequentially.
Consider also that modern operative systems cache parts of file system in memory to improve performance and reduce disk access (this is true especially for multiple deletes).
Your database will grow but engines are optimized for it, no need to care about it.
Only downside is that handling many small files could cause disk fragmentation if your file system is subjected to it.
For a performance bonus, you should consider to use a separate phisical storage for these files (e.g. a different disk drive or disk array) so you will take advantage of full bandwidth transfer with no other interferences.

MYSQL concatenating large string

I have a web crawler that saves information to a database as it crawls the web. While it does this, it also saves a log file of its actions, and any errors it encounters to a log field in a mysql database (field becomes anywhere from 64kb to 100kb. It accomplishes this by concatenating (using the mysql CONCAT function).
This seems to work fine, but I am concerned about the cpu useage / impact it has on the mysql database. I've noticed that the web crawling is performing slower than before I implemented saving the log to the database.
I view this log file from a management webpage, and the current implementation seems to work fine other than the slow loading. Any recommendations for speeding this up, or implementation recommendations?
Reading 100kb strings into memory numerous time then write them to disk via a db. Of course your going to experience slowdown! Every part of what you are doing is going to task memory, disk, and cpu (especially if memory usage hits the system max and you start swapping to disk). Let me count some of the ways your going to possibly decrease overall site performance:
Sql connections max out and back up as the time to store 100kb records increases time a single process holds a connection
Webserver processes eat up free process pool and max out and take longer to free up because they have to wait on db connections to free.
Web server processes begin to bloat and take more memory each, possibly more than the system can handle without swapping. This is compounded by using the max. Umber of processes due to #2
... A book could be written on your situation.