Couchbase Performance - couchbase

I have Couchbase community edition v4, build 4047. Everything seems to be great with it until I started issuing queries against a simple view. The view is just projecting the documents like so, which seems harmless:
function (doc, meta) {
if(doc.applicationId){
emit(doc.applicationId,meta.id);
}
}
I'm using the .Net client to connect and execute the query from my application, though I don't think that matters. It's a single node configuration. I'm clocking time in between the actual http requests and the queries are taking between 4 seconds up to over 2 minutes if I send something like 15 requests in at a time through Fiddler.
I am using a stale index to try and boost that time, but it doesn't seem to have much impact. The bucket is not very large. There are only a couple of documents in the bucket. I've allocated 100M RAM for indexing. I'd think that's fine for at least the few documents we're working with at the moment.
This is primarily local development, but we are observing similar behaviors when promoted to our servers. The servers don't use a significant amount of RAM either, but at the same time we aren't storing a significant amount of documents. We're only talking about 10 or 20 at the most? These documents only contain like 5 primitive-type properties.
Do you have some suggestions for diagnosing this? The logs through the couchbase admin console don't show anything unusual as far as I can tell and this doesn't seem like normal behavior.
Update:
Here is my code to query the documents
public async Task ExpireCurrentSession(string applicationId)
{
using (var bucket = GetSessionBucket())
{
var query = bucket
.CreateQuery("activeSessions", "activeSessionsByApplicationId")
.Key(applicationId)
.Stale(Couchbase.Views.StaleState.Ok);
var result = await bucket.QueryAsync<string>(query);
foreach (var session in result.Rows)
{
await bucket.RemoveAsync(session.Value);
}
}
}

The code seems fine, and should work as you expect. The 100mb RAM you mention allocating actually isn't for views, it only affects N1QL global secondary indexes. Which brings me to the following suggestion:
You don't need to use a view for this in Couchbase 4.0; you can use N1QL to do this simpler and (probably) more efficiently.
Create a N1QL index on the applicationId field (either in code of from the cbq command line shell) like so:
CREATE INDEX ix_applicationId ON bucketName(applicationId) USING GSI;
You can then use a simple SELECT query to get the relevant document IDs:
SELECT META(bucketName) FROM bucketName WHERE applicationId = '123';
Or even simpler, you can just use a DELETE query to delete them directly:
DELETE FROM bucketName WHERE applicationId = '123';
Note that DML statements, like DELETE are still considered a beta feature in Couchbase 4.0, so do your own risk assessment.
To run N1QL queries from .NET you use almost the same syntax as for views:
await bucket.QueryAsync<dynamic>("QUERY GOES HERE");

Related

Is there a way that raw TypeORM queries could lead to connection pool problems?

As far as I can tell, usages of query do call release on the query runner instance they use (and there's no transactions involved), however, weirdly enough, I've been getting some database calls (through TypeORM) get stuck for no apparent reason and I'm trying to exclude potential causes for this.
await this.myDatasource.query('SELECT * FROM users WHERE id = ?', [id]);

MySql get_lock for concurrency safe upsert

Writing an API in node.js with a mysql db and I am implementing a fairly standard pattern of:
If exists then update
else insert
This of course works fine until multiple simultaneous requests are made to the api at which point the If exists on request 2 can get executed before the insert of request 1, leading to two records instead of one.
I know that one way of dealing with this is ensure that the db has a constraint or key that prevents the duplicate record but in this case the rules that determine whether we should have an insert or update are more complicated and so the check needs to be done in code.
This sounded like a good case for using a mutex/lock. I need this to be distributed as the api may have multiple instances running as part of a pool/farm.
I've come up with the following implementation:
try {
await this.databaseConnection.knexRaw().raw(`SELECT GET_LOCK('lock1',10);`);
await this.databaseConnection.knexRaw().transaction(async (trx) => {
const existing = await this.findExisting(id);
if (existing) {
await this.update(myThing);
} else {
await this.insert(myThing);
}
});
} finally {
await this.databaseConnection.knexRaw().raw(`SELECT RELEASE_LOCK('lock1');`);
}
This all seems to work fine and my tests now produce only a single insert. Although it seems a bit brute force/manual. Being new to mysql and node (I come from a c# and sql server background) is this approach sane? Is there a better approach?
Is it sane? Subjective.
Is it technically safe? It could be -- GET_LOCK() is reliable -- but not as you have written it.
You are ignoring the return value of GET_LOCK(), which is 1 if you got the lock, 0 if the timeout expired and you didn't get the lock, and NULL in some failure cases.
As written, you'll wait 10 seconds and then do the work anyway, so, not safe.
This assumes you have only one MySQL master. It wouldn't work if you have multiple masters or Galera, since Galera doesn't replicate GET_LOCK() across all nodes. (A Galera cluster is a high availability MySQL/MariaDB/Percona cluster of writable masters that replicate synchronously and will survive the failure/isolation of up to (ceil(n/2) - 1) out of n total nodes).
It would be better to find and lock the relevant rows using SELECT ... FOR UPDATE, which locks the found rows or, in some cases, the gap where they would be if they existed, blocking other transactions that are attempting to capture the same locks until you rollback or commit... but if that is not practical, using GET_LOCK() is valid, subject to the point made above about the return value.

Async Bulk(batch) insert to MySQL(or MongoDB?) via Node.js

Straight to the Qeustion ->.
The problem : To do async bulk inserts (not necessary bulk, if MySql can Handle it) using Node.js (coming form a .NET and PHP background)
Example :
Assume i have 40(adjustable) functions doing some work(async) and each adding a record in the Table after its single iteration, now it is very probable that at the same time more than one function makes an insertion call. Can MySql handle it that ways directly?, considering there is going to be an Auto-update field.
In C#(.NET) i would have used a dataTable to contain all the rows from each function and in the end bulk-insert the dataTable into the database Table. and launch many threads for each function.
What approach will you suggest in this case,
Shall the approach change in case i need to handle 10,000 or 4 million rows per table?
ALso The DB schema is not going to change, will MongoDB be a better choice for this?
I am new to Node, NoSql and in the noob learning phase at the moment. So if you can provide some explanation to your answer, it would be awesome.
Thanks.
EDIT :
Answer : Neither MySql or MongoDB support any sort of Bulk insert, under the hood it is just a foreach loop.
Both of them are capable of handling a large number of connections simultanously, the performance will largely depend on you requirement and production environment.
1) in MySql queries are executed sequentially per connection. If you are using one connection, your 40~ functions will result in 40 queries enqueued (via explicit queue in mysql library, your code or system queue based on syncronisation primitives), not necessarily in the same order you started 40 functions. MySQL won't have any race conditions problems with auto-update fields in that case
2) if you really want to execute 40 queries in parallel you need to open 40 connections to MySQL (which is not a good idea from performance point of view, but again, Mysql is designed to handle auto-increments correctly for multiple clients)
3) There is no special bulk insert command in the Mysql protocol on the wire level, any library exposing bulk insert api in fact just doing long 'insert ... values' query.

Document-oriented dbms as primary db and a RDBMS db as secondary db?

I'm having some performance issues with MySQL database due to it's normalization.
Most of my applications that uses a database needs to do some heavy nested queries, which in my case takes a lot of time. Queries can take up 2 seconds to run, with indexes. Without indexes about 45 seconds.
A solution I came a cross a few month back was to use a faster more linear document based database, in my case Solr, as a primary database. As soon as something was changed in the MySQL database, Solr was notified.
This worked really great. All queries using the Solr database only took about 3ms.
The numbers looks good, but I'm having some problems.
Huge database
The MySQL database is about 200mb, the Solr db contains about 1.4Gb of data.
Each time I need to change a table/column the database need to be reindexed, which in this example took over 12 hours.
Difficult to render both a Solr object and a Active Record (MySQL) object without getting wet.
The view is relying on a certain object. It doesn't care if the object it self is an Active Record object or an Solr object, as long as it can call a set of attributes on the it.
Like this.
# Controller
#song = Song.first
# View
#song.artist.urls.first.service.name
The problem in my case is that the data being returned from Solr is flat like this.
{
id: 123,
song: "Waterloo",
artist: "ABBA",
service_name: "Groveshark",
urls: ["url1", "url2", "url3"]
}
This forces me to build an active record object that can be passed to the view.
My question
Is there a better way to solve the problem?
Some kind of super duper fast primary read only database that can handle complex queries fast would be nice.
Solr individual fields update
About reindexing all on schema change: Solr does not support updating individual fields yet, but there is a JIRA issue about this that's still unresolved. However, how many times do you change schema?
MongoDB
If you can live without a RDBMS (without joins, schema, transactions, foreign key constrains), a document-based DB like MongoDB,
or CouchDB would be a perfect fit. (here is a good comparison between them )
Why use MongoBD:
data is in native format (you can use an ORM mapper like Mongoid directly in the views, so you don't need to adapt your records as you do with Solr)
dynamic queries
very good performance on non-full text search queries
schema-less (no need for migrations)
build-in, easy to setup replication
Why use SOLR:
advanced, very performant full-text search
Why use MySQL
joins, constrains, transactions
Solutions
So, the solutions (combinations) would be:
Use MongoDB + Solr
but you would still need to reindex all on schema change
Use only MongoDB
but drop support for advanced full-text search
Use MySQL in a master-slave configuration, and balance reads from slave(s) (using a plugin like octupus) + Solr
setup complexity
Keep current setup, denormalize data in MySQL
messy
Solr reindexing slowness
The MySQL database is about 200mb, the Solr db contains about 1.4Gb of
data. Each time I need to change a table/column the database need to
be reindexed, which in this example took over 12 hours.
Reindexing 200MB DB in Solr SHOULD NOT take 12 hours! Most probably you have also other issues like:
MySQL:
n+1 issue
indexes
SOLR:
commit after each request - this is the default setup is you use a plugin like sunspot, but it's a perf killer for production
From http://outoftime.github.com/pivotal-sunspot-presentation.html:
By default, Sunspot::Rails commits at the end of every request
that updates the Solr index. Turn that off.
Use Solr's autoCommit
functionality. That's configured in solr/conf/solrconfig.xml
Be
glad for assumed inconsistency. Don't use search where results need to
be up-to-the-second.
other setup issues (http://wiki.apache.org/solr/SolrPerformanceFactors#Indexing_Performance)
Look at the logs for more details
Instead of pushing your data into Solr to flatten the records, why don't you just create a separate table in your MySQL database that is optimized for read only access.
Also you seem to contradict yourself
The view is relying on a certain object. It doesn't care if the object it self is an Active Record object or an Solr object, as long as it can call a set of attributes on the it.
The problem in my case is that the data being returned from Solr is flat... This forces me to build a fake active record object that can be rendered by the view.

Expiring memcached using mysql proxy when an update occurs?

I have mysql Proxy running which takes a query, performs an md5 on it, and caches the result into a memcached DB. the problem occurs when an update happens in the rails app that would invalidate that cache. Any ideas on how to invalidate all of the proper keys in the cache at that time?
The core of the problem, is you don't know what the key is since it is md5 generated.
However, you can mitigate the problem by not storing data for that query.
You query may look like this "SELECT my_data.* FROM my_data WHERE conditions"
However, you can reduce the redudeancy of data by use this query instead
SELECT my_data.id FROM my_data WHERE conditions
Which is then followed up by
Memcache.mget( ids )
This won't prohibit the return on data that no longer matches the conditions, but may mitigate returning stale data.
--
Another option is to look into using namespaces: See here:
http://code.google.com/p/memcached/wiki/NewProgrammingTricks#Namespacing
You can namespace all of your major queries. You won't be able to delete the keys, but you can change the key version id, which will in effect expire your data.
Logistically messy, but you could use it on a few bad queries.
--
lastly, you could store those queries in a different memcache server and flush on a more frequent basis.