How to store json-patch operations in redis queue and guarantee their consistency? - json

I have collaborative web application that handles JSON-objects like the following:
var post = {
id: 123,
title: 'Sterling Archer',
comments: [
{text: 'Comment text', tags: ['tag1', 'tag2', 'tag3']},
{text: 'Comment test', tags: ['tag2', 'tag5']}
]
};
My approach is using rfc6902 (JSONPatch) specification with jsonpatch library for patching JSON document. All such documents store in MongoDB database and as you know the last one very slow for frequent writes.
To get more speed and highload application I use redis as queue for a patch operations like the following:
{ "op": "add", "path": "/comments/2", "value": {text: 'Comment test3', tags: ['tag4']}" }
I just store all such patch operations in queue and at midnight run cron script to get all patches and construct full document and update it in MongoDB database.
I don't understand yet what should I do in case corrupted patch like:
{ "op": "add", "path": "/comments/0/tags/5", "value": 'tag4'}
The patch above don't gets applied to document above because tags array has length only 3 (according official docs https://www.rfc-editor.org/rfc/rfc6902#page-5)
The specified index MUST NOT be greater than the number of elements in the array.
So when user is online he don't get any errors because his patch operations get stored in redis queue but next day he get broken document due broken patch that don't got applied in cron script.
So my question if how could I guarantee that all patches that stored in redis queue is correct and don't corrupts primary document?

As with any system that can become inconsistent, you must allow for patches to be applied as quickly as possible if you wish to catch conflicts sooner and decrease the likelihood of running into them. That is likely your main issue if you are not notifying the other clients of any updated data as soon as possible (and are just waiting for the CRON to run to update the shared data that the other clients can access).
As others have asked, it's important to understand how a "bad" patch got into the operation queue in the first place. Here are some guesses from my standpoint:
A user had applied some operations that got lost in translation. How? I don't know, but it would explain the discrepancy.
Operations are not being applied in the correct order. How? I don't know. I have no code to go off of.
Although I have no code to go off of, I can take a shot in the dark and help you analyze the latter point. The first thing we need to analyze is the different scenarios that may come up with updating a "shared" resource. It's important to note that, in any system that must eventually be consistent, we care about the:
Order of the operations.
How we will deal with conflicts.
The latter is really up to you, and you will need a good notification/messaging system to update the "truth" that clients see.
Scenario 1
User A applies operations 1 & 2. The document is updated on the server and then User B is notified of this. User B was going to apply operations 3 & 4, but these operations (in this order) do not conflict with operations 1 & 2. All is well in the world. This is a good situation.
Scenario 2
User A applies operations 1 & 2. User B applies operations 3 & 4.
If you apply the operations atomically per user, you can get the following queues:
[1,2,3,4] [3,4,1,2]
Anywhere along the line, if there is a conflict, you must notify either User A or User B based on "who got there first" (or any other weighting semantics you wish to use). Again, how you deal with conflicts is up to you. If you have not read up on vector clocks, you should do so.
If you don't apply operations atomically per user, you can get the following queues:
[1,2,3,4] [3,4,1,2] [1,3,2,4] [3,1,4,2] [3,1,2,4] [1,3,4,2]
As you can see, forgoing atomic updates per user increases the combinations of updates and will therefore increase the likelihood of a collision happening. I urge you to ensure that operations are being added to the queue atomically per user.
A Recap
Some important things you should remember:
Make sure updates to the queue are atomically applied per user.
Figure out how you will deal with several versions of a shared resource arising from multiple mutations from different clients (again I suggest you read up on vector clocks).
Don't update a shared resource that may be accessed by several clients in real-time as a cron job.
When there is a conflict that cannot be resolved, figure out how you will deal with it.
As a result of point 3, you will need to come up with a notification system so that clients can get updated resources quickly. As a result of point 4, you may choose to include telling clients that something went wrong with their update. Something that has just come to the top of my head is that you're already using Redis, which has pub/sub capabilities.
EDIT:
It seems like Google Docs handles conflict resolutions with transformations. That is, by shifting whole characters/lines over to make way for a hybrid application of all operations: https://drive.googleblog.com/2010/09/whats-different-about-new-google-docs_22.html
As I had said before, it's all up to how you want to handle your own conflicts, which should largely be determined by the application/product itself and its use cases.

IMHO you are introducing unneeded complexity instead of simpler solution. These would be my alternate suggestions instead of your approach of a json patch cron which is very hard to make consistent and atomic.
Use mongodb only : With proper database design and indexing, and proper hdarware allocation/sharding, the write performance of mongodb is really fast. And the kind of operations you are using in jsonpatch are natively supported in mongodb BSON documents and their query language .e.g $push,$set,$inc,$pull etc.
Perhaps you want to not interrupt users activities with a syncronous write to Mongodb , for that the solution is using async queus as mentioned in point#2.
2.Use task queues & mongodb: Instead of storing patches in redis like you do now, you can push the patching task to a task queue, which will asyncronously do the mongodb update , and user will not experience any slow performance. One very good task queue is Celery , which can use Redis as a broker & messaging backend. So, each users updates get a single task, and will get applied to mongodb by the task queue, and there will be no performance hit.

Related

The point of Stryket.Net since.ignore-changes-in config

Stryker-net has an option since.ignore-changes-in and I'm trying to understand in which use case it may be useful to ignore non C# files 1.
The doc gives that example of value ['/*Assets.json','/favicon.ico'], but if my last commit changes only Assets.json and if I run stryker with "since": {"target": "<sha1 of my previous commit>"}, then no mutation will be found on this change, with or without this ignore-changes-in option, right?
What am I missing?
1: Sure, if I don't find it useful I could just avoid using it on my personal projects. However I'm responsible for providing stryker in CI for several teams in my company and I'm hence interested in making sure I have a good understanding of the consequences of using (or not) this.
I can think of two cases.
First, it might bring some performance gain (just by reducing the amount of files Stryker will get to), e.g. when your change updates 100 XML localizations of some string.
Second, you might have real code you don't want to run Stryker against. E.g. you have some Obsolete folder with a bunch of projects you don't care about but which automatically updates with some renaming and so on. This might be interchangeable with the mutate option, unless you want to have different filters for "normal" and "diff" Stryker executions.

Replacing items in message queue

Our system requirements say that we need to build a slightly unusual producer-consumer processing system. Imagine we have multiple data streams and we take a snapshot each X seconds and put it into the queue for processing. The streams count is not constant. The more clients we have, the more streams we need to process. At the same time, we don't need to process ALL taken snapshots. If we have too many clients and we are not able to process all items in real-time, we would prefer to skip old snapshots and process only the latest ones.
So as I see, the requirements can be met by keeping only one item in a queue for each stream. If there is a new snapshot, while the previous is still there, we need to REPLACE it using stream id as a key.
Is it possible to implement such behavior by Service Bus queue or something similar? Or maybe it makes sense to look into some other solutions like Redis?
So as I see, the requirements can be met by keeping only one item in a
queue for each stream. If there is a new snapshot, while the previous
is still there, we need to REPLACE it using stream id as a key. Is it
possible to implement such behavior by Service Bus queue or something
similar?
To the best of my knowledge, Azure Service Bus does not support this scenario. Through it's duplicate detection functionality, in fact it supports exact opposite of that. You would need to use some other mechanism (like Redis Cache you mentioned) to accomplish this.

Dealing with exceptions in an event driven world

I'm trying to understand how exceptions are handled in an event driven world using micro-services (using apache kafka). For example, if you take the following order scenario whereby the following actions need to happen before the order can be completed.
1) Authorise the payment with the payment service provider
2) Reserve the item from stock
3.1) Capture the payment with the payment service provider
3.2) Order the item
4) Send a email notification accepting the order with a receipt
At any stage in this scenario, there could be a failure such as:
The item is no longer in stock
The payment information was incorrect
The account the payee is using doesn't have the funds available
External calls such as those to the payment service provider fail, such as downtime
How do you track that each stage has been called for and/or completed?
How do you deal with issues that arise? How would you notify the frontend of the failure?
Some of the things you describe are not errors or exceptions, but alternative flows that you should consider in your distributed architecture.
For example, that an item is out of stock is a perfectly valid alternative flow in your business process. One that possibly requires human intervention. You could move the message to a separate queue and provide some UI where a human operator can deal with the problem, solve it and cause the flow of events to continue.
A similar thing could be said of the payment problems you describe. If an order cannot successfully be settled, a human operator will need to investigate the case and solve it. For that matter, your design must contemplate that alternative flow as part of it, and make it so a human can intervene somehow when the messages end up in a queue that requires a person to review them.
Those cases should be differentiated from errors or exceptions being thrown by the program. Those cases, depending on the circumstance, might in fact require to move the message to a dead letter queue (DLQ) for an engineer to take a look at them.
This is a very broad topic and entire books could written about this.
I believe you could probably benefit from gaining more understanding of concepts like:
Compensating Transactions Pattern
Try/Cancel/Confirm Pattern
Long Running Transactions
Sagas
The idea behind compensating transactions is that every ying has its yang: if you have one transaction that can place an order, then you could undo that with a transaction that cancels that order. This latter transaction is a compensating transaction. So, if you carry out a number of successful transactions and then one of them fails, you can trace back your steps and compensate every successful transaction you did and, as a result, revert their side effects.
I particularly liked a chapter in the book REST from Research to Practice. Its chapter 23 (Towards Distributed Atomic Transactions over RESTful Services) goes deep in explaining the Try/Cancel/Confirm pattern.
In general terms it implies that when you do a group of transactions, their side effects are not effective until a transaction coordinator gets a confirmation that they all were successful. For example, if you make a reservation in Expedia and your flight has two legs with different airlines, then one transaction would reserve a flight with American Airlines and another one would reserve a flight with United Airlines. If your second reservation fails, then you want to compensate the first one. But not only that, you want to avoid that the first reservation is effective until you have been able to confirm both. So, initial transaction makes the reservation but keeps its side effects pending to confirm. And the second reservation would do the same. Once the transaction coordinator knows everything is reserved, it can send a confirmation message to all parties such that they confirm their reservations. If reservations are not confirmed within a sensible time window, they are automatically reversed by the affected system.
The book Enterprise Integration Patterns has some basic ideas on how to implement this kind of event coordination (e.g. see process manager pattern and compare with routing slip pattern which are similar ideas to orchestration vs choreography in the Microservices world).
As you can see, being able to compensate transactions might be complicated depending on how complex is your distributed workflow. The process manager may need to keep track of the state of every step and know when the whole thing needs to be undone. This is pretty much that idea of Sagas in the Microservices world.
The book Microservices Patterns has an entire chapter called Managing Transactions with Sagas that delves in detail on how to implement this type of solution.
A few other aspects I also typically consider are the following:
Idempotency
I believe that a key to a successful implementation of your service transactions in a distributed system consists in making them idempotent. Once you can guarantee a given service is idempotent, then you can safely retry it without worrying about causing additional side effects. However, just retrying a failed transaction won't solve your problems.
Transient vs Persistent Errors
When it comes to retrying a service transaction, you shouldn't just retry because it failed. You must first know why it failed and depending on the error it might make sense to retry or not. Some types of errors are transient, for example, if one transaction fails due to a query timeout, that's probably fine to retry and most likely it will succeed the second time; but if you get a database constraint violation error (e.g. because a DBA added a check constraint to a field), then there is no point in retrying that transaction: no matter how many times you try it will fail.
Embrace Error as an Alternative Flow
As mentioned at the beginning of my answer, not everything is an error. Some things are just alternative flows.
In those cases of interservice communication (computer-to-computer interactions) , when a given step of your workflow fails, you don't necessarily need to undo everything you did in previous steps. You can just embrace error as part of you workflow. Catalog the possible causes of error and make them an alternative flow of events that simply requires human intervention. It is just another step in the full orchestration that requires a person to intervene to make a decision, resolve an inconsistency with the data or just approve which way to go.
For example, maybe when you're processing an order, the payment service fails because you don't have enough funds. So, there is no point in undoing everything else. All we need is to put the order in a state that some problem solver can address it in the system and, once fixed, you can continue with the rest of the workflow.
Transaction and Data Model State are Key
I have discovered that this type of transactional workflows require a good design of the different states your model has to go through. As in the case of Try/Cancel/Confirm pattern, this implies initially applying the side effects without necessarily making the data model available to the users.
For example, when you place an order, maybe you add it to the database in a "Pending" status that will not appear in the UI of the warehouse systems. Once payments have been confirmed the order will then appear in the UI such that a user can finally process its shipments.
The difficulty here is discovering how to design transaction granularity in way that even if one step of your transaction workflow fails, the system remains in a valid state from which you can resume once the cause of the failure is corrected.
Designing for Distributed Transactional Workflows
So, as you can see, designing a distributed system that works in this way is a bit more complicated than individually invoking distributed transactional services. Now every service invocation may fail for a number of reasons and leave your distributed workflow in a inconsistent state. And retrying the transaction may not always solve the problem. And your data needs to be modeled like a state machine, such that side effects are applied but not confirmed until the entire orchestration is successful.
That‘s why the whole thing may need to be designed in a different way than you would typically do in a monolithic client–server application. Your users may now be part of the designed solution when it comes to solving conflicts, and contemplate that transactional orchestrations could potentially take hours or even days to complete depending on how their conflicts are resolved.
As I was originally saying, the topic is way too broad and it would require a more specific question to discuss, perhaps, just one or two of these aspects in detail.
At any rate, I hope this somehow helped you with your investigation.

What are examples of real-world scenarios where a message queuing system can accept the loss of some messages?

I was reading this blog post, in which the author proposes the following question, in the context of message queues:
does it matter if a message is lost? If you application node, processing the request, dies, can you recover? You’ll be surprised how often it doesn’t actually matter, and you can function properly without guaranteeing all messages are processed
At first I thought that the main point of handling messages was to never loose a single message - after all, a message lost could mean a hotel reservation not booked, a checkout not completed, or any other functionality not carried through, which seems too similar to a bug for me. I suppose I am missing something, so, what are examples of scenarios where it is OK for a messaging system to loose a few messages?
Well, your initial expectation:
the main point of handling messageswas to never loose a single message
was just not a correct one.
Right, if one strives for a one certain type of robustness, where fail-safe measures have to take all due care and precautions, so as not a single message could get lost, yes, there your a priori expressed expectation fits.
This does not mean that all other system designs have to carry all the immense burdens and have to pay all that incurred costs ( resources-wise, latency-wise et al ), as the "100+% guaranteed delivery" systems do ( but, again, only if they can ).
Anti-pattern cases:
There are many use-cases, where an absolute certainty of delivery of each and every message originally sent is actually an anti-pattern.
Just imagine a weakly synchronised system ( including ones, that have nothing like backthrottling or even any simplest form of feedback propagation at all ), where the sensors read an actual temperature, a sound, a video-frame and send a message with that value(s).
Whenever a postprocessing system gets such information delivered, there may be a reason not to read any and all "old" values, but the most recent one(s).
If a delivery framework already got any newer set of values, all the "older" values, not processed yet, just hanging at some depth from the queue-head, yet in the queue, might create the anti-pattern, where one would not like to have to read and process any and all of those "older" values, but just the most recent one(s).
Like no one will make a trade with you based on yesterday prices, there is not positive value to make any new, current, decision based on reading any and all "old" temperature readings, that still wait in the queue.
Some smart-messaging frameworks provide explicit means for taking just the very "newest" message from a given source - thus enabling to imperatively discard any "older" messages, avoiding them from being read and processed right due to a known presence of a one "most" recent.
This answers the original question about the assumed main point of handling messages.
Efficiency first:
In any case, where a smart-delivery takes place ( either deliver an exact copy of the original message content or noting-at-all ), the resources are used at their best efforts, yet, without spending a single penny on anything but the "just-enough" smart-delivery.
Building robustness costs more than that.
Building an ultimate robustness, costs even way more than that.
Systems than do have such an extreme requirement can and may extend the resources-efficient smart-delivery so as to reach some requirements defined level of robustness, at some add-on costs.
The same but reversed is not possible -- if an "everything-proof" system is to get a slimmer form and fashion, so as to fit onto any restricted-resources hardware or to make it "forget" some "old" messages, that are of no positive value at this very moment ( but on the contrary, constitute a must for the processing element to read and process each and every "unwanted" message, just due to the fact it was delivered, while knowing a core-logic needs just the most recent one ).
Distributed systems accrue E2E-latency from many distributed sources, so any rigid-delivery system just block and penalise the only one element, who is ( latency-wise ) innocent -- the receiver.
I suppose it's OK to loose few messages from some measurement units that deliver the value once in.... Also for big data analytics solutions few lost messages won't make a big difference
It all depends on the application/larger system. The message queue is only one link in the chain, so to speak. If the application(s) at the ends are prepared to deal with loss, losing some messages is not a problem. If the application(s) rely on total messaging integrity then there will be problems.
An example of a system that will be ok with loss is weather updates for your phone. If a few temperature/wind updates don't make it to you there's no real harm in that.
Now, if you're running a nuclear reactor and you lose a few temperature updates on the core, well that is a problem.
I work a lot on safety critical, infrastructure-level systems, and am responsible for messaging much of the time. Many of those systems state clearly that messaging may reorder, duplicate, or lose messages; it's just a fact of life where distributed systems and networks are involved. The endpoint systems need to be designed to work correctly in that environment. So they track messages, ack end to end, deal with duplicates and retransmits, etc.

Message queuing solution for millions of topics

I'm thinking about system that will notify multiple consumers about events happening to a population of objects. Every subscriber should be able to subscribe to events happening to zero or more of the objects, multiple subscribers should be able to receive information about events happening to a single object.
I think that some message queuing system will be appropriate in this case but I'm not sure how to handle the fact that I'll have millions of the objects - using separate topic for every of the objects does not sound good [or is it just fine?].
Can you please suggest approach I should should take and maybe even some open source message queuing system that would be reasonable?
Few more details:
there will be thousands of subscribers [meaning not plenty of them],
subscribers will subscribe to tens or hundreds of objects each,
there will be ~5-20 million of the objects,
events themselves dont have to carry any message. just information that that object was changed is enough,
vast majority of objects will never be subscribed to,
events occur at the maximum rate of few hundreds per second,
ideally the server should run under linux, be able to integrate with the rest of the ecosystem via http long-poll [using node js? continuations under jetty?].
Thanks in advance for your feedback and sorry for somewhat vague question!
I can highly recommend RabbitMQ. I have used it in a couple of projects before and from my experience, I think it is very reliable and offers a wide range of configuraions. Basically, RabbitMQ is an open-source ( Mozilla Public License (MPL) ) message broker that implements the Advanced Message Queuing Protocol (AMQP) standard.
As documented on the RabbitMQ web-site:
RabbitMQ can potentially run on any platform that Erlang supports, from embedded systems to multi-core clusters and cloud-based servers.
... meaning that an operating system like Linux is supported.
There is a library for node.js here: https://github.com/squaremo/rabbit.js
It comes with an HTTP based API for management and monitoring of the RabbitMQ server - including a command-line tool and a browser-based user-interface as well - see: http://www.rabbitmq.com/management.html.
In the projects I have been working with, I have communicated with RabbitMQ using C# and two different wrappers, EasyNetQ and Burrow.NET. Both are excellent wrappers for RabbitMQ but I ended up being most fan of Burrow.NET as it is easier and more obvious to work with ( doesn't do a lot of magic under the hood ) and provides good flexibility to inject loggers, serializers, etc.
I have never worked with the amount of amount of objects that you are going to work with - I have worked with thousands ( not millions ). However, no matter how many objects I have been playing around with, RabbitMQ has always worked really stable and has never been the source to errors in the system.
So to sum up - RabbitMQ is simple to use and setup, supports AMQP, can be managed via HTTP and what I like the most - it's rock solid.
Break up the topics to carry specific events for e.g. "Object updated topic" "Object deleted"...So clients need to only have to subscribe to the "finite no:" of event based topics they are interested in.
Inject headers into your messages when you publish them and put intelligence into the clients to use these headers as message selectors. For eg, client knows the list of objects he is interested in - and say you identify the object by an "id" - the id can be the header, and the client will use the "id header" to determine if he is interested in the message.
Depending on whether you want, you may also want to consider ensuring guaranteed delivery to make sure that the client will receive the message even if it goes off-line and comes back later.
The options that I would recommend top of the head are ActiveMQ, RabbitMQ and Redis PUB SUB ( Havent really worked on redis pub-sub, please use your due diligance)
Finally here are some performance benchmarks for RabbitMQ and Redis
Just saw that you only have few 100 messages getting pushed out / sec, this is not a big deal for activemq, I have been using Amq on a system that processes 240 messages per second , and it just works fine. I use a thread pool of workers to asynchronously process the messages though . Look at a framework like akka if you are in the java land, if not stick with nodejs and the cool Eco system around it.
If it has to be open source i'd go for ActiveMQ, and an application server to provide the JMS functionality for topics and it has Ajax Support so you can access them from your client
So, you would use the JMS infrastructure to publish the topics for the objects, and you can create topis as you need them
Besides, by using an java application server you may be able to take advantages from clustering, load balancing and other high availability features (obviously based on the selected product)
Hope that helps!!!
Since your messages are very small might want to consider MQTT, which is designed for small devices, although it works fine on powerful devices as well. Key consideration is the low overhead - basically a 2 byte header for a small message. You probably can't use any simple or open source MQTT server, due to your volume. You probably need a heavy duty dedicated appliance like a MessageSight to handle your volume.
Some more details on your application would certainly help. Also you don't mention security at all. I assume you must have some needs in this area.
Though not sure about your work environment but here are my bits. Can you identify each object with unique ID in your system. If so, you can have a topic per each event type. for e.g. you want to track object deletion event, object updation event and so on. So you can have topic for each event type. These topics would be published with Ids of object whenever corresponding event happened to the object. This will limit the no of topics you needed.
Second part of your problem is different subscribers want to subscribe to different objects. So not all subscribers are interested in knowing events of all objects. This problem statement scoped to message selector(filtering) mechanism provided by messaging framework. So basically you need to seek on what basis a subscriber interested in particular object. Have that basis as a message filtering mechanism. It could be anything: object type, object state etc. So ultimately your system would consists of one topic for each event type with someone publishing event messages : {object-type:object-id} information. Subscribers could subscribe to any topic and with an filtering criteria.
If above solution satisfy, you can use any messaging solution: activeMQ, WMQ, RabbitMQ.