Achieve a strong consistent view - couchbase

I just started using couchbase and hoping to use it as my data store.
One of my requirements in performing a query that will return a certain field about all the documents in the store. This query is done once at the server startup.
For this purpose I need all the documents that exist and can't miss any of them.
I understand that views in couchbase are eventually consistent but I still hope this query can be done (at the cost of performance).
Notes about my configurating:
I have only one couchbase server instance (I dont need sharding or
replication)
I am using the java client (1.4.1)
What I have tried to do is saving my documents this way:
client.set(key, value, PersistTo.ONE).get();
And querying using:
query.setStale(Stale.FALSE);
Adding the PersistTo parameter caused the following exception:
Cause by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed out waiting for operation - failing node: <unknown>
at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:167)
at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:140)
So I guess I am actually asking 3 questions:
Is it possible to get the consistent results I need?
If so, is what I suggested the correct way of doing that?
How can I prevent those exceptions?
The mapping I'm using:
function (doc,meta) {
if (doc.doc_type && doc.doc_type == "MyType" && doc.myField) {
emit(meta.id,null);
}
}
Thank you

Is it possible to get the consistent results I need?
Yes it is possible to set Couchbase views to be consistent by setting the STALE flag to false as you've done. However there are performance impacts with this, so dependent on your data size the query may be slow, if you are only going to be doing it once a day then it should be ok.
Couchbase is designed to be a distributed system comprising of more than node, it's not really suitable for single node deployments. I have read (but can't find the link) that view performances are much better in larger clusters.
You are also forcing more of a sync processing model onto a system that shines with async requests, PersistTo is ok to use for some requests but not system wide on every call (personal opinion), it'll definitely throttle throughput and performance.
If so, is what I suggested the correct way of doing that?
You say the query is done after your application server is running, is this once per day or more? If once a day then your application should work (I'd consider upping the nodes ;)), if you have to do this query a lot and you are 'hammering' the node over and over with sets then I'd expect to see what you are currently experiencing.
How can I prevent those exceptions?
It could be a variety of reasons, what are the specs of your computer, RAM,CPU,DISK? How much ram is allocated to Couchbase, how much to your bucket, what % of the bucket ram is used?
I've personally seen this when I've hammered some lower end AWS instances on some not so amazing networks. What version of Couchbase are you using? It could be a whole variety of factors that and deserves to be a separate question.
Hope that helps!
EDIT regarding more information on the Stale = false parameter (from official docs)
http://docs.couchbase.com/couchbase-manual-2.2/#couchbase-views-writing-stale
The index is updated before the query is executed. This ensures that any documents updated (and persisted to disk) are included in the view. The client will wait until the index has been updated before the query has executed, and therefore the response will be delayed until the updated index is available.

Related

Should I implement my own caching or rely on read-replicas?

We have an enterprise application that uses an SQL database. The database access characteristics are about 90% reads. The data that does get updated or created needs to be up-to-date immediately. The cache needs to be correctly invalidated with high certainty. The entities are referred to by their primary key for 98% of the cases.
The application is based on Node.js and is AWS-native. Since the application is AWS-native, I'd like to rely on managed services from AWS rather than hosting my own. One option is to implement our read-through Redis-based cache. Upon retrieving the entities, we'd check the cache and if the data is not cached we'd put it into the cache before turning it to the user. The parts of the code that update those entities will invalidate the cache by primary key.
Generally speaking, in computer science cache coherency is one of the most challenging problems to get right. I am of the opinion that rather than implementing a Redis cache and thinking through all of the possible scenarios for correctly invalidating it, it is wiser to instead configure an Aurora read-replica specifically for reading frequently accessed entities. The RDBMS will do a much better job at caching than anything we can build ourselves.
So, I am facing two options -- go through the effort of implementing my own caching, or use read replicas. My personal opinion is to use a read replica.
Any advice is greatly appreciated, as always.
Yes, you're right, cache invalidation is a tough problem. The simplest solution is to add code to your data writes, to replace the cached values. So they're always current. But this is easy only if the cached values have a pretty much 1-to-1 correlation with rows in your database.
An advantage of your own cache is that you can cache data that is not 1-to-1 with rows of data in the database. You might cache an entire HTML fragment for a drop-down menu for example. That could be the result of several SQL queries. It could be quite an advantage to cache data that is higher up the "food chain" so to speak. But cache invalidation becomes less straightforward. Best for storing results of queries that don't change often.
Using a read-replica is not a substitute for using a cache. Querying a read-replica still has overhead of making a database connection, authentication, SQL query parsing and optimization, locking, and all the other overhead that goes into RDBMS workings.
Querying data from a cache can be orders of magnitude faster.
Both have their place. It's best to use both a cache and a read-replica for different tasks. I would also add message queues as an important technology. I believe database, cache, and queue form a three-legged stool.
But you must have experience and judgment to know when each is the best tool for a given case.

Can NHibernate be configured not to open a new MySQL connection for each query?

We have a web service built using Asp.net Web API. We use NHibernate as our ORM connecting to a MySQL database.
We have a couple of controller methods that do a large number (1,000-3,000) of relatively cheap queries.
We're looking at improving the performance of these controller methods and almost all of the time is spent doing the NHibernate queries so that's where we're focusing our attention.
In the medium term the solutions are things like reducing the number of queries (perhaps by doing fewer larger queries) and/or to parallelize the queries (which would take some work since NHibernate does not have an async api and sessions are single threaded) and things like that.
In the short term we're looking at improving the performance without taking on either of those larger projects.
We've done some performance profiling and were surprised to find that it looks like a lot of the time in each query (over half) is spent opening the connection to MySQL.
It appears that NHibernate is opening a new connection to MySQL for each query and that MySqlConnection.Open() makes two round trips to the database each time a connection is opened (even when the connection is coming from the pool).
Here's a screenshot of one of our performance profiles where you can see these two things:
We're wondering if this is expected or if we're missing something like a misconfiguration/misuse of NHibernate or a way to eliminate the two round trips to the database in MySqlConnection.Open().
I've done some additional digging and found something interesting:
If we add .SetProperty(Environment.ReleaseConnections, "on_close") to the NHibernate configuration then Open() is no longer called and the time it takes to do the query drops by over 40%.
It seems this is not a recommended setting though: http://nhibernate.info/doc/nhibernate-reference/transactions.html#transactions-connection-release
Based on the documentation I expected to get the same behavior (no extra calls to Open()) if I wrapped the reads inside a single NHibernate transaction but I couldn’t get it to work. This is what I did in the controller method:
using (var session = _sessionFactory.OpenSession()) {
using (var transaction = session.BeginTransaction()) {
// controller code
transaction.Commit();
}
}
Any ideas on how to get the same behavior using a recommended configuration for NHibernate?
After digging into this a bit more it turns out there was a mistake in my test implementation and after fixing it using transactions eliminates the extra calls to Open() as expected.
Not using transaction is considered a bad practice, so starting to add them is anyway welcome.
Moreover, as you seem to have find out by yourself, the default connection release mode auto currently always translates to AfterTransaction, which with NHibernate (v2 to v4 at least) releases connections after each statement when no transactions are ongoing for the session.
From Connection Release Modes:
Note that with ConnectionReleaseMode.AfterTransaction, if a session is considered to be in auto-commit mode (i.e. no transaction was started) connections will be released after every operation.
So simply transacting your session usages should do it. As this is not the case for your application, I suspect other issues.
Is your controller code using other sessions? NHibernate explicit transactions apply only to the session from which their were started (or to sessions opened from that session with ISession.GetSession(EntityMode.Poco)).
So you need to handle a transaction for each opened session.
You may use a TransactionScope instead for wrapping many sessions in a single transaction. But each session will still open a dedicated connection. This will in most circumstances promote the transaction to distributed, which has a performance penalty and may fail if your server is not configured to enable it.
You may configure and use a contextual session instead for replacing many sessions per controller action by only one. Of course you can use dependency injection instead for achieving this too.
Notes:
About reducing the number of queries issued by an application, there are some easy to leverage features in NHibernate:
Batching of lazy-loads (Batch fetching): configure your lazily loaded entities and collections of entities to not only load themselves, but also some others awaiting entities (of the same class) or collections of entities (same collections of other parent entities). Add batch-size attribute on collections and classes. I have written a detailed explanation of it in this other answer.
Second level cache, which will allow caching of data across http requests. Transactions mandatory for it to work.
Future queries, as proposed by Low.
Going parallel for a web API looks to me as a doomed road. Threads are a valuable ressource for web application. The more threads a request uses, the less requests the web application will be able to serve in parallel. So going that way will very likely be a major pain for your application scalability.
The OnClose mode is not recommended because it delays connection releasing to session closing, which may occur quite late after the last transaction, especially when using contextual session. Since it looks like your session usage is very localized, likely with a closing very near the last query, it should not be an issue for your application.
parallelize the queries (which would take some work since NHibernate
does not have an async api and sessions are single threaded) and
things like that.
You can defer the execution of the queries using NHibernate Futures,
Following code (extracted from reference article) will execute single query despite there are 2 values retrieved,
using (var s = sf.OpenSession())
using (var tx = s.BeginTransaction())
{
var blogs = s.CreateCriteria<Blog>()
.SetMaxResults(30)
.Future<Blog>();
var countOfBlogs = s.CreateCriteria<Blog>()
.SetProjection(Projections.Count(Projections.Id()))
.FutureValue<int>();
Console.WriteLine("Number of blogs: {0}", countOfBlogs.Value);
foreach (var blog in blogs)
{
Console.WriteLine(blog.Title);
}
tx.Commit();
}
You can also use NHibernate Batching to reduce the number of queries

Scaling up a ruby, activerecord, mysql app

I have an app...
The app does a market comparison for a financial product - for a given quote request, it contacts several other sites for their quotes. It then gives the user the results - several quotes for their details.
To manage these requests they get saved to MySQL and then my app kicks in, picking up the pending quotes and farms these out to threads (all same Linux box) to process each site lookup.
I am using JRuby as I had thread/db related issues. Using Java threadpools to control the number of threads. With the current hardware/VPS - it can handle around 200 threads. A lot of the limitations seem to relate to each thread grabbing their own MySQL connection - grabbing the quote details and saving back the results. We want to handle more concurrent threads and so looking for ways to scale up.
Wondering which way to go ...
Bigger hardware...
More machines and use some kind of queueing
mechanism (with priorities) to share the load across the machines -
so the threads dont touch the db, all the details/responses go via
the queue - so the DB hit is less, but then maybe I am just pushing
the problem into the queue. Thinking of using something like
MongoDB for the queue, but open to suggestions - something easy to
use with Ruby :)
Some kind of remote/RPC mechanism, eg dRb -
theoretically this seems like a good option, but not done anything
with this yet to know how complex it will make things.
Something
else...?
From this link Reasons for NOT scaling-up vs. -out? - it would seem this problem is suited to running more machines to solve it.
So, any thoughts on which way to go...
Cheers,
Chris
My usual approach to problems like this is to pay very close attention to the database queries you're making and tune them aggressively. Retrieve only what you need, skipping columns that aren't explicitly used, and be very careful about eager loading things you don't need in their entirety.
You'll often find you can get significant speed gains by adding indexes, or strategically de-normalizing certain attributes in your database to avoid ugly, time-consuming JOIN operations.
Further, think about caching: The fastest database call is the one that's never made. It's not hard to leverage in something like Memcached to save the results of a moderately time-consuming record retrieval and if done carefully it's even easy to invalidate and expire this provided you channel your updates through a few methods.
For scheduling workers, a simple first-in, first-out queue can be implemented in Redis to off-load a lot of the processing overhead from MySQL itself. This is usually very simple to add if you follow an example.
A cache like Memcached can handle an extremely high amount of traffic, so whenever possible, cache against this to avoid hitting your database for every last thing.
If you've exhausted these options, it's time for more front-end servers and even more database capacity, but only then.
Queing is easiest thing for you to implement. Use something like this: http://beanstalkd.github.com/beaneater/
Basically you can prepend your methods with async. which will put them into queue and execute them. They queue and workers can be same server or a different one.

MySQL and Hibernate Simultaneous read write

I have a web application which has the following parts:
Commentators continuously doing match commentary through a browser based tool. The comments are inserted into DB using hibernat.
Lots of users are accessing a URL to read commentary. Hibernate is reading data from the table being updated by commentators in step #1.
There are some stored procedures as well which are set to run every 1 hour. Few of them access the same table (used in step #1 and #2) for reading and writing/updating purpose.
Now my problem is, whenever the site has 100+ concurrent users watching a particular match commentary, my MySQL goes down. It shows lots of queries stuck in processlist. Many of them are in "Copying to temp table" state. This makes the JBOSS restart frequently.
I am using transactions in hibernate for both reading and writing purposes. Please help because I loose big matches because of these crashes.
You have a performance problem. It is difficult to give solutions which always work. What you can consider to do is:
1) Revise the HQL (Hibernate) statements. For this best you write a protocol with <property name="show_sql">true</property> in the config file (or even a tool like log4jdbc if you want to see the actual parameters) and analyse the output. There you see which SQL requests you have most. In many cases a better strategy for reading and writing db data can significantly reduce the database traffic. And check you have good indexes for your table.
2) Consider to use a second level cache. (Normally hibernate only uses the first level cache, which is of no use in your case because it is bound to one session.) Then at least the requests for reading actual commentaries can be served by the cache and don't need to go to the database. (Pay attention: The cache might interfere with the stored procedures. Have a look if the cache product you like to use supports MySQL stored procedures. In the worst case you have to remove the stored procedures for the critical tables and let you application server do the job so it goes through the cache.)
3) If it is only a few tables which are heavily used you can consider to cache them by your application. That's more work, but perhaps you can do it exactly for the demands of your application, so you might be faster than with a general second level cache.
4) If nothing helps and the traffic is really too heavy then perhaps you have to invest in more hardware.
Good luck ;-)

SQL Azure performance considerations

Which are the performance considerations I should keep in mind when I'm planning an SQL Azure application? Azure Storage, and the worker and the web roles looks very scalable, but if at the end they are using one database... it looks like the bottleneck.
I was trying to find numbers about:
How many concurrent connections does
SQL Azure support?
Which is the bandwidth?
But no luck.
For example, I'm planning and application that uses a very high level of inserts, but I need return the result of an aggregate function each time (e.g.: the sum of all records with same key in a column), so I can not go with table storage.
Batching is an option, but time response is critical as well, so I'm afraid the database will be bloated with lot of connections.
Sharding is another option, but even when the amount of inserts is massive, the amount of data is very small, 4 to 6 columns with one PK and no FK. So even a 1Gb DB would be an overkill (and an overpay :D) for a partition.
Which would be the performance keys I should keep in mind when I'm facing these kind of applications?
Cheers.
Achieving both scalability and performance can be very difficult, even in the cloud. Your question was primarily about scalability, so you may want to design your application in such a way that your data becomes "eventually" consistent, using queues for example. A worker role would listen for incoming insert requests and would perform the insert asynchronously.
To minimize the number of roundtrips to the database and optimize connection pooling make sure to batch your inserts as well. So you could send 100 inserts in one shot. Also keep in mind that SQL Azure now supports MARS (multiple active recordsets) so that you can return multiple SELECTs in a single batch back to the calling code. The use of batching and MARS should reduce the number of database connections to a minimum.
Sharding usually helps for Read operations; not so much for inserts (although I never benchmarked inserts with sharding). So I am not sure sharding will help you that much for your requirements.
Remember that the Azure offering is designed first for scalability and reasonable performance in a multitenancy environment, where your database is shared with others on the same server. So if you need strong performance with guaranteed response time you may need to reevaluate your hosting choices or indeed test the performance boundaries of Azure for your needs as suggested by tijmenvdk.
SQL Azure will throttle your connections if any form of resource contention occurs (this includes heavy load but might also occur when your database is physically moved around). Throttling is non-deterministic, meaning that you cannot predict if and when this happens. When throttling, SQL Azure will drop your connection, requiring you to perform a retry. Number of connections supported and bandwidth is not published "by design" due to the flexible nature of the underlying infrastructure. Having said that, the setup is optimized for high availability, not high throughput.
If the bursts happen at a known time, you might consider sharding just during those bursts and consolidating the data after the burst has happened. Another way to handle this, is to start queueing/batching writes if and only if throttling occurs. You can use an Azure Queue for that plus a worker role to empty the queue later. This "overflow mechanism" has the advantage of automatically engaging if throttling occurs.
As an alternative you could use Azure Table Storage and keep a separate table of running totals that you can report back instead of performing an aggregation over the data to return the required sum of all records (this might be tricky due to the lack of locking on the tables though).
Apologies for stating the obvious, but the first step would be to test if you run into throttling at all in your scenario. I would give the overflow solution a try.