I am designing a high volume service which essentially works as a token creation and validation service. Today we use SQL based databases which are fast failing to scale and the alternative that we are looking at is Couchbase server (memcached). However, the use case is the token generated by this service is sent to other services where the token will be used for authentication. If the replication is not fast enough, the authentication fails. Is there a simpler means to achieve this either via code or any other alternatives is also welcome. This seems to us to be a "read-your-own-write" use case.
There is no consistency API for XDCR right now. It's probably somewhere on the dev roadmap, because it's one of the commonly requested features.
If you want to get RYOW consistency across data centers, our only option is to write to all the DCs simulataneously from the application code. Of course, it's not atomic, but you can work around that somewhat by waiting for all the clusters to acknowledge the write before proceeding.
I think you might be misunderstanding what replication is for in Couchbase. Couchbase is strongly consistent for standard data operations. If you write a document or key/value pair, you can immediately read that object. No waiting for replication. The use case you are talking about is a very common one for Couchbase.
The only things that are eventually consistent in Couchbase are XDCR, for obvious reasons, and reading from views. Even then, this can be minimized with proper cluster sizing.
Related
I am working on a project that uses Couchbase Server and Sync Gateway to synchronize the contents of a bucket with iOS and Android clients running Couchbase Lite. I also need read and write access to the Couchbase Server from a Node.js server application. From the research I've done, using shadowing is frowned upon (https://github.com/couchbase/sync_gateway/wiki/Bucket-Shadowing), which led me to look into the Sync Gateway API as a means to update the bucket from the Node.js application. Updating existing documents through the Sync Gateway API appears to require the most recent revision ID of the document to be passed in, requiring a separate read before the modification (http://mobile-couchbase.narkive.com/HT2kvBP0/cblite-sync-gateway-couchbase-server), which seems potentially inefficient. What is the best way to solve this problem?
Updating a document (which is really creating a new revision) requires the revision ID. Otherwise Couchbase can't associate the update with a parent. This breaks the whole approach to conflict resolution. (Couchbase uses a method known as multiversion concurrency control.)
The expectation is that you're updating the existing contents of a document. This implies you've read the document already, including the revision ID.
If for some reason you don't need to the old contents to update the document, you still need the revision ID. If you work around it (for example, by purging a document through Sync Gateway and then pushing your new version) you can end up with two versions of document in the system with no connection, which will cause a special kind of conflict.
So the short answer is no, there's no way to avoid this (without causing yourself other headaches).
I am not sure why your question was downvoted, as it seems like a reasonable question. You are correct, the Couchbase bucket that is used by Sync Gateway should probably best be thought of as "opaque", you should not be poking around in there and changing things. There are a number of implementations of Couchbase Lite, such as one for Java, .NET, and Mac OS X. Have you considered making a web service that, on one side, is serving your application, and on the other side is itself a Couchbase Lite client? You should be able to separate your data as necessary using channels.
I just started using couchbase and hoping to use it as my data store.
One of my requirements in performing a query that will return a certain field about all the documents in the store. This query is done once at the server startup.
For this purpose I need all the documents that exist and can't miss any of them.
I understand that views in couchbase are eventually consistent but I still hope this query can be done (at the cost of performance).
Notes about my configurating:
I have only one couchbase server instance (I dont need sharding or
replication)
I am using the java client (1.4.1)
What I have tried to do is saving my documents this way:
client.set(key, value, PersistTo.ONE).get();
And querying using:
query.setStale(Stale.FALSE);
Adding the PersistTo parameter caused the following exception:
Cause by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed out waiting for operation - failing node: <unknown>
at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:167)
at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:140)
So I guess I am actually asking 3 questions:
Is it possible to get the consistent results I need?
If so, is what I suggested the correct way of doing that?
How can I prevent those exceptions?
The mapping I'm using:
function (doc,meta) {
if (doc.doc_type && doc.doc_type == "MyType" && doc.myField) {
emit(meta.id,null);
}
}
Thank you
Is it possible to get the consistent results I need?
Yes it is possible to set Couchbase views to be consistent by setting the STALE flag to false as you've done. However there are performance impacts with this, so dependent on your data size the query may be slow, if you are only going to be doing it once a day then it should be ok.
Couchbase is designed to be a distributed system comprising of more than node, it's not really suitable for single node deployments. I have read (but can't find the link) that view performances are much better in larger clusters.
You are also forcing more of a sync processing model onto a system that shines with async requests, PersistTo is ok to use for some requests but not system wide on every call (personal opinion), it'll definitely throttle throughput and performance.
If so, is what I suggested the correct way of doing that?
You say the query is done after your application server is running, is this once per day or more? If once a day then your application should work (I'd consider upping the nodes ;)), if you have to do this query a lot and you are 'hammering' the node over and over with sets then I'd expect to see what you are currently experiencing.
How can I prevent those exceptions?
It could be a variety of reasons, what are the specs of your computer, RAM,CPU,DISK? How much ram is allocated to Couchbase, how much to your bucket, what % of the bucket ram is used?
I've personally seen this when I've hammered some lower end AWS instances on some not so amazing networks. What version of Couchbase are you using? It could be a whole variety of factors that and deserves to be a separate question.
Hope that helps!
EDIT regarding more information on the Stale = false parameter (from official docs)
http://docs.couchbase.com/couchbase-manual-2.2/#couchbase-views-writing-stale
The index is updated before the query is executed. This ensures that any documents updated (and persisted to disk) are included in the view. The client will wait until the index has been updated before the query has executed, and therefore the response will be delayed until the updated index is available.
All s/w is Windows based, coded in Delphi.
Some guys submit some data, which I send by TCP to a database server running MySql.
Some other guys add a pass/fail to their data and update the database.
And a third group are just looking at reports.
Now, the first group can see a history of what they submitted. When the second group adds pass/fail, I would like to update their history. My options seem to be
blindly refresh the history regularly (in Delphi, I display on a DB grid so I would close then open the query), but this seems inefficient.
ask the database server regularly if anything changed in the last X minutes.
never poll the database server, instead letting it inform the user's app when something changes.
1 seems inefficient. 2 seems better. 3 reduces TCP traffic, but that isn't much. Anyway, just a few bytes for each 2. However, it has the disadvantage that both sides are now both TCP client and server.
Similarly, if a member of the third group is viewing a report and a member of either of the first two groups updates data, I wish to reflect this in the report. What it the best way to do this?
I guess there are two things to consider. Most importantly, reduce network traffic and, less important, make my code simpler.
I am sure this is a very common pattern, but I am new to this kind of thing, so would welcome advice. Thanks in advance.
[Update] Close voters, I have googled & can't find an answer. I am hoping for the beneft of your experience. Can you help me reword this to be acceptable? or maybe give a UTL which will help me? Thanks
Short answer: use notifications (option 3).
Long answer: this is a use case for some middle layer which propagates changes using a message-oriented middleware. This decouples the messaging logic from database metadata (triggers / stored procedures), can use peer-to-peer and publish/subscribe communication patterns, and more.
I have blogged a two-part article about this at
Firebird Database Events and Message-oriented Middleware (part 1)
Firebird Database Events and Message-oriented Middleware (part 2)
The article is about Firebird but the suggested solutions can be applied to any application / database.
In your scenarios, clients can also use the middleware message broker send messages to the system even if the database or the Delphi part is down. The messages will be queued in the broker until the other parts of the system are back online. This is an advantage if there are many clients and update installations or maintenance windows are required.
Similarly, if a member of the third group is viewing a report and a
member of either of the first two groups updates data, I wish to
reflect this in the report. What it the best way to do this?
If this is a real requirement (reports are usually a immutable 'snapshot' of data, but maybe you mean a view which needs to be updated while beeing watched, similar to a stock ticker) but it is easy to implement - a client just needs to 'subscribe' to an information channel which announces relevant data changes. This can be solved very flexible and resource-saving with existing message broker features like message selectors and destination wildcards. (Note that I am the author of some Delphi and Free Pascal client libraries for open source message brokers.)
Related questions:
Client-Server database application: how to notify clients that data was changed?
How to communicate within this system?
Each of your proposed solutions are all viable in certain situations.
I've been writing software for a long time and comments below relate to personal experience which dates way back to 1981. I have no doubt others will have alternative opinions which will also answer your questions.
Please allow me to justify the positives and negatives of each approach, and the parameters around each comment.
"blindly refresh the history regularly (in Delphi, I display on a DB grid so I would close then open the query), but this seems inefficient."
Yes, this is inefficient
Is often the quickest and simplest thing to do.
Seems like the best short-term temporary solution which gives maximum value for minimal effort.
Good for "exploratory coding" helping derive a better software design.
Should be a good basis to refine / explore alternatives.
It's very important for programmers to strive to document and/or share with team members who could be affected by your changes their team when a tech debt-inducing fix has been checked-in.
If not intended as production quality code, this is acceptable.
If usability is poor, then consider more efficient solutions, like what you've described below.
"ask the database server regularly if anything changed in the last X minutes."
You are talking about a "pull" or "polling" model. Consider the following API options for this model:
What's changed since the last time I called you? (client to provide time to avoid service having to store and retrieve seesion state)
If nothing has changed, server can provide a time when the client should poll again. A system under excessive load is then able to back-off clients, i.e if a server application has an awareness of such conditions, then it is therefore better able to control the polling rate of compliant clients, by instructing them to wait for a longer period before retrying.
After considering that, ask "Is the API as simple as it can possibly be?"
"never poll the database server, instead letting it inform the user's app when something changes."
This is the "push" model you're talking about- publishing changes, ready for subscribers to act upon.
Consider what impact this has on clients waiting for a push - timeout scenarios, number of clients, etc, System resource consumption, etc.
Consider that the "pusher" has to become aware of all consuming applications. If using industry standard messaging queueing systems (RabbitMQ, MS MQ, MQ Series, etc, all naturally supporting Publish/Subscribe JMS topics or equivalent then this problem is abstracted away, but also added some complexity to your application)
consider the scenarios where clients suddenly become unavailable, hypothesize failure modes and test the robustness of you system so you have confidence that it is able to recover properly from failure and consistently remain stable.
So, what do you think the right approach is now?
I have a python application where I want to start doing more work in the background so that it will scale better as it gets busier. In the past I have used Celery for doing normal background tasks, and this has worked well.
The only difference between this application and the others I have done in the past is that I need to guarantee that these messages are processed, they can't be lost.
For this application I'm not too concerned about speed for my message queue, I need reliability and durability first and formost. To be safe I want to have two queue servers, both in different data centers in case something goes wrong, one a backup of the other.
Looking at Celery it looks like it supports a bunch of different backends, some with more features then the others. The two most popular look like redis and RabbitMQ so I took some time to examine them further.
RabbitMQ:
Supports durable queues and clustering, but the problem with the way they have clustering today is that if you lose a node in the cluster, all messages in that node are unavailable until you bring that node back online. It doesn't replicated the messages between the different nodes in the cluster, it just replicates the metadata about the message, and then it goes back to the originating node to get the message, if the node isn't running, you are S.O.L. Not ideal.
The way they recommend to get around this is to setup a second server and replicate the file system using DRBD, and then running something like pacemaker to switch the clients to the backup server when it needs too. This seems pretty complicated, not sure if there is a better way. Anyone know of a better way?
Redis:
Supports a read slave and this would allow me to have a backup in case of emergencies but it doesn't support master-master setup, and I'm not sure if it handles active failover between master and slave. It doesn't have the same features as RabbitMQ, but looks much easier to setup and maintain.
Questions:
What is the best way to setup celery
so that it will guarantee message
processing.
Has anyone done this before? If so,
would be mind sharing what you did?
A lot has changed since the OP! There is now an option for high-availability aka "mirrored" queues. This goes pretty far toward solving the problem you described. See http://www.rabbitmq.com/ha.html.
You might want to check out IronMQ, it covers your requirements (durable, highly available, etc) and is a cloud native solution so zero maintenance. And there's a Celery broker for it: https://github.com/iron-io/iron_celery so you can start using it just by changing your Celery config.
I suspect that Celery bound to existing backends is the wrong solution for the reliability guarantees you need.
Given that you want a distributed queueing system with strong durability and reliability guarantees, I'd start by looking for such a system (they do exist) and then figuring out the best way to bind to it in Python. That may be via Celery & a new backend, or not.
I've used Amazon SQS for this propose and got good results. You will recieve message until you will delete it from queue and it allows to grow you app as high as you will need.
Is using a distributed rendering system an option? Normally reserved for HPC but alot of concepts are the same. Check out Qube or Deadline Render. There are other, open source solutions as well. All have failover in mind given the high degree of complexity and risk of failure in some renders that can take hours per image sequence frame.
What is the most efficient way of implementing queues to be read by another thread/process?
I'm thinking of using a basic MySQL table with polling on sleep. This sounds to be the most scalable (it doesn't even have to be on the same server) but might potentially result in too many queries to the DB.
You have several options, and it really depends on what you are trying to get the system to do.
fork child processes, and interface using connections their stdin/stdout pipes.
create a named pipe on the file system, like /tmp/mysql.sock. This is basically using sockets to communicate cross process.
Setup a message broker. I'd recommend giving ActiveMQ a try, and using the Stomp client for Perl. This is probably your most scalable solution.
This is one of those things that is simple to write yourself to your exact specifications. I wrote a toy one here:
http://github.com/jrockway/app-queue
I am not sure it compiles anymore, as AnyEvent::Subprocess has changed significantly since I wrote it. But you can steal the ideas.
Basically, I think an RPC-style infrastructure is the best. You have a server that handles keeping the data. Then clients connect and add data or remove data via RPC calls. This gives you ultimate flexibility with the semantics. You can be "transactional" so that if a client takes data and then never says "hey, I am done with it", you can assume the client died and give the job to another client. You can also ensure that each job is only run once.
Anyway, making a queue work with a relational database table involves a bit of effort. You should use something KiokuDB for the persistence. (You can physically store the data in MySQL if you desire, but this provides a nicer Perl API to that.)
In PostgreSQL you could use the NOTIFY/LISTEN combination, would need only a wait on the PG connection socket after running LISTEN(s).