Couchbase document expiration performance - couchbase

I have a 6 nodes couchbase cluster with about 200 million documents in one bucket. I need to delete about 100 million documents within that bucket. I'm planning to have a view that gives me an index of the documents I need to delete and then do a touch operation to set the expiry of those documents to the next day.
I understand that couchbase will run a background expiry pager operation on regular intervals to delete the documents. Will this expiry pager process on 100 million documents have an impact on couchbase cluster performance?

If you set them to expire all around the same time, maybe. It depends on your cluster's sizing if it will effect performance. If it were me, unless you have some compelling reason to get rid of them all right this moment, I would play it safe just set a random TTL for a time between now and a few days from now. Then the server will take care of it for you and you do not have to worry about this.
Document expiration in Couchbase is in seconds or UNIX epoc time. If over 30 days, it has to be UNIX epoc time.

Related

Spark memory requirements for querying 20gb

Before dive into the actual coding I am trying to understand the logistics around Spark.
I have server logs split in 10 csv's round 2GB each.
I am looking for a way to extract some data e.g. how many failures occured in a period of 30 minutes per server.
(the logs have entries from multiple servers aka there is no any predefined order in time and per server)
Is that something I could do with spark?
If yes would that mean I need a box with 20+ GB of RAM?
When I operate in Spark with RDDs,does take into account the full dataset?E.g. an operation of ordering by timestamp and server id would execute to the full 20GB dataset?
Thanks!

Caching Response in MEMORY vs. MyISAM vs. InnoDB

I need to cache API responses due to a high flood delay. So I will cache every response for up to 30 seconds and whenever I get a cache hit, I will use the cached data, if it's younger than 30 seconds.
Which storage engine is the best for that?
First thought was MEMORY, but it's locking the whole table which I think could make some trouble when many users are online.
Then I thought about InnoDB, but then I read that MyISAM is better with many write operations.
But MyISAM again locks the whole table.
Which storage engine is best as a API response cache?
None of the above.
MySQL is a database, not a cache. The idea is to store data persistently (and in the case of InnoDB, durably). There is no TTL (time to live) or automatic expiration of data in SQL. You would also have to remove data that is older than 30 seconds yourself, or else it would accumulate.
You should use a cache server for data you want to disappear after 30 seconds. Memcached for example allows you to set a TTL in seconds when you set an object in the cache.
InnoDB.
But...
You say "API responses" -- is that HTML pages? AJAX replies? SQL queries? Something else?
How often is the data changing? Your design will deliver data that is up to 30 seconds out of date; is that acceptable? The "Query cache" avoids this, but has other issues.
Have you considered memcached?
What is the overhead of inserting into the proposed cache and checking it every time? Probably more than the alternatives.
Quite probably you have one or two queries that are a lot slower than they have to be. Fixing them would eliminate the need for extra caching. Please SHOW CREATE TABLE and EXPLAIN SELECT .... We may be able to quickly spot ways to speed them up.

API Logging and MYSQL Load

I made an api and am logging all the requests to it.
When someone hits the api, there is an insert and an update (in order to record the api response code)
During testing the api log is around 200k records, this might go to a few million records very quickly.
Does this kind of logging, ie insert and update, put alot of pressure on the server?
My concern is that mysql will get overloaded due to logging so not sure if I should trim the logs every 7 days or something.
A few million records won't hurt the database storage-wise. I've got MySQL tables with hundreds of millions of rows and they work fine when indexed. I would consult the MySQL manual for Data Manipulation Statements and Optimization to see if what you're doing is going to stretch performance limits.
MySQL's indexing guide if you'd like to know more.
If you're worried about overloading your DB, you could write a cron job to backup your database each day and then just truncate/load after hitting a certain row count or time of day or whatever. You would then have a backup of all the records that you need.
Hope this helps.

IOPS or Throughput? - Determining Write Bottleneck in Amazon RDS Instance

We have nightly load jobs that writes several hundred thousand records to an Mysql reporting database running in Amazon RDS.
The load jobs are taking several hours to complete, but I am having a hard time figuring out where the bottleneck is.
The instance is currently running with General Purpose (SSD) storage. By looking at the cloudwatch metrics, it appears I am averaging less than 50 IOPS for the last week. However, Network Receive Throughput is less than 0.2 MB/sec.
Is there anyway to tell from this data if I am being bottlenecked by network latency (we are currently loading the data from a remote server...this will change eventually) or by Write IOPS?
If IOPS is the bottleneck, I can easily upgrade to Provisioned IOPS. But if network latency is the issue, I will need to redesign our load jobs to load raw data from EC2 instances instead of our remote servers, which will take some time to implement.
Any advice is appreciated.
UPDATE:
More info about my instance. I am using an m3.xlarge instance. It is provisioned for 500GB in size. The load jobs are done with the ETL tool from pentaho. They pull from multiple (remote) source databases and insert into the RDS instance using multiple threads.
You aren't using up much CPU. Your memory is very low. An instance with more memory should be a good win.
You're only doing 50-150 iops. That's low, you should get 3000 in a burst on standard SSD-level storage. However, if your database is small, it is probably hurting you (since you get 3 iops per GB- so if you are on a 50gb or smaller database, consider paying for provisioned iops).
You might also try Aurora; it speaks mysql, and supposedly has great performance.
If you can spread out your writes, the spikes will be smaller.
A very quick test is to buy provisioned IOPS, but be careful as you may get fewer than you do currently during a burst.
Another quick means to determine your bottleneck is to profile your job execution application with a profiler that understands your database driver. If you're using Java, JProfiler will show the characteristics of your job and it's use of the database.
A third is to configure your database driver to print statistics about the database workload. This might inform you that you are issuing far more queries than you would expect.
Your most likely culprit accessing the database remotely is actually round-trip latency. The impact is easy to overlook or underestimate.
If the remote database has, for example, a 75 millisecond round-trip time, you can't possibly execute more than 1000 (milliseconds/sec) / 75 (milliseconds/round trip) = 13.3 queries per second if you're using a single connection. There's no getting around the laws of physics.
The spikes suggest inefficiency in the loading process, where it gathers for a while, then loads for a while, then gathers for a while, then loads for a while.
Separate but related, if you don't have the MySQL client/server compression protocol enabled on the client side... find out how to enable it. (The server always supports compression but the client has to request it during the initial connection handshake), This won't fix the core problem but should improve the situation somewhat, since less data to physically transfer could mean less time wasted in transit.
I'm not an RDS expert and I don't know if my own particular case can shed you some light. Anyway, hope this give you any kind of insight.
I have a db.t1.micro with 200GB provisioned (that gives be 600 IOPS baseline performance), on a General Purpose SSD storage.
The heaviest workload is when I aggregate thousands of records from a pool of around 2.5 million rows from a 10 million rows table and another one of 8 million rows. I do this every day. This is what I average (it is steady performance, unlike yours where I see a pattern of spikes):
Write/ReadIOPS: +600 IOPS
NetworkTrafficReceived/Transmit throughput: < 3,000 Bytes/sec (my queries are relatively short)
Database connections: 15 (workers aggregating on parallel)
Queue depth: 7.5 counts
Read/Write Throughput: 10MB per second
The whole aggregation task takes around 3 hours.
Also check 10 tips to improve The Performance of your app in AWS slideshare from AWS Summit 2014.
I don't know what else to say since I'm not an expert! Good luck!
In my case it was the amount of records. I was writing only 30 records per minute and had an Write IOPS of round about the same 20 to 30. But this was eating at the CPU, which reduced the CPU credit quite steeply. So I took all the data in that table and moved it to another "historic" table. And cleared all data in that table.
CPU dropped back down to normal measures, but Write IOPS stayed about the same, this was fine though. The problem: Indexes, I think because so many records needed to indexed when inserting it took way more CPU to do this indexing with that amount of rows. Even though the only index I had was a Primary Key.
Moral of my story, the problem is not always where you think it lies, although I had increased Write IOPS it was not the root cause of the problem, but rather the CPU that was being used to do index stuff when inserting which caused the CPU credit to fall.
Not even X-RAY on the Lambda could catch an increased query time. That is when I started to look at the DB directly.
Your Queue depth graph shows > 2 , which clearly indicate that the IOPS is under provisioned. (if Queue depth < 2 then IOPS is over provisioned)
I think you have used the default AUTOCOMMIT = 1 (autocommit mode). It performs a log flush to disk for every insert and exhausted the IOPS.
So,It is better to use (for performance tuning) AUTOCOMMIT = 0 before bulk inserts of data in MySQL, if the insert query looks like
set AUTOCOMMIT = 0;
START TRANSACTION;
-- first 10000 recs
INSERT INTO SomeTable (column1, column2) VALUES (vala1,valb1),(vala2,valb2) ... (val10000,val10000);
COMMIT;
--- next 10000 recs
START TRANSACTION;
INSERT INTO SomeTable (column1, column2) VALUES (vala10001,valb10001),(vala10001,valb10001) ... (val20000,val20000);
COMMIT;
--- next 10000 recs
.
.
.
set AUTOCOMMIT = 1
Using the above approach in t2.micro and inserted 300000 in 15 minutes using PHP.

Set eventual consistency (late commit) in MySQL

Consider the following situation: You want to update the number of page views of each profile in your system. This action is very frequent, as almost all visits to your website result in a page view incremental.
The basic way is update Users set page_views=page_views+1. But this is totally not optimal because we don't really need instant update (1 hour late is ok). Is there any other way in MySQL to postpone a sequence of updates, and make cumulative updates at a later time?
I myself tried another method: storing a counter (# of increments) for each profile. But this also results in handling a few thousands of small files, and I think that the disk IO cost (even if a deep tree-structure for files is applied) would probably exceed the database.
What is your suggestion for this problem (other than MySQL)?
To improve performance you could store your page view data in a MEMORY table - this is super fast but temporary, the table only persists while the server is running - on restart it will be empty...
You could then create an EVENT to update a table that will persist the data on a timed basis. This would help improve performance a little with the risk that, should the server go down, only the number of visits since the last run of the event would be lost.
The link posted by James via the comment to your question, wherein lies an accepted answer with another comment about memcached was my first thought also. Just store the profileIds in memcached then you could set up a cron to run every 15 minutes and grab all the entries then issue the updates to MySQL in a batch, but there are a few things to consider.
When you run the batch script to grab the ids out of memcached, you will have to ensure you remove all entries which have been parsed, otherwise you run the risk of counting the same profile views multiple times.
Being that memcache doesn't support wildcard searching via keys, and that you will have to purge existing keys for the reason stated in #1, you will probably have to setup a separate memcache server pool dedicated for the sole purpose of tracking profile ids, so you don't end up purging cached values which have no relation to profile view tracking. However, you could avoid this by storing the profileId and a timestamp within the value payload, then have your batch script step through each entry and check the timestamp, if it's within the time range you specified, add it to queue to be updated, and once you hit the upper limit of your time range, the script stops.
Another option may be to parse your access logs. If user profiles are in a known location like /myapp/profile/1234, you could parse for this pattern and add profile views this way. I ended up having to go this route for advertiser tracking, as it ended up being the only repeatable way to generate billing numbers. If they had any billing disputes we would offer to send them the access logs and parse for themselves.