How to send massive data of sensors in Orion - fiware

Let's suppose to have 100 sensors that send an attribute any second to Orion. How could I manage this massive data?
via batch operation (but I don't know if it can support them)
using an edge (to aggregate data) and sending to Orion (after 1 minute)
Thank you

Let’s consider 100 tps a high load for a given infrastructure (load and throughput must be always related to infrastructure and e2e scenarios).
The main problem you may encounter is not related to the update itself, Orion Context Broker and its fork Orion LD, can handle a lot of updates. The main problem in real/productive scenarios, like the ones handled by Orion Context Broker and NGSI v2, are the NOTIFICATIONS related to those UPDATES.
If you need a 1:1 (or even a 1:2 or 1:4) ratio between UPDATES:NOTIFICATIONS, for example you want to keep track of the history of every measure and also send the measures to the CEP for some post-processing, then it’s not only a matter of how many updates Orion may handle, but how many update-notifications the E2E can handle. If you got a slow notification endpoint Orion will saturate its notification queues and you will be losing notifications (not keeping track of those updates within en historic, or CEP…).
Batch updates are not helping on this since the UPDATE REQUEST SERVER is not the bottleneck and they are internally managed as single updates.
To alleviate this problem I should recommend you to enable NGSI V2 (only available in V2) flow control mechanism, so the update process may be automatically slowed down when the notification throughput requires so.
And of course, in any IoT scenario if you don’t need all the data the earlier you aggregate the better. So if your E2E doesn’t need to keep track of every single measure, data loggers are more than welcome.

For 100 sensors sending one update per second (did I understand that correctly?) ... that's nothing. The broker can handle 2-3 thousand updates per second running in a single core and with ~4 GB of RAM (mongodb needs about 3 times that).
And, if it's more (a lot more), then yes, the NGSI-LD API defines batch operations (for Create, Update, Upsert, and Delete of entities), and Orion-LD implements them all.
However, there's no batch op for attribute update. You'd need to use "batch update entity", the update mode (not replace). Check the NGSI-LD API spec for details.

Related

Replacing items in message queue

Our system requirements say that we need to build a slightly unusual producer-consumer processing system. Imagine we have multiple data streams and we take a snapshot each X seconds and put it into the queue for processing. The streams count is not constant. The more clients we have, the more streams we need to process. At the same time, we don't need to process ALL taken snapshots. If we have too many clients and we are not able to process all items in real-time, we would prefer to skip old snapshots and process only the latest ones.
So as I see, the requirements can be met by keeping only one item in a queue for each stream. If there is a new snapshot, while the previous is still there, we need to REPLACE it using stream id as a key.
Is it possible to implement such behavior by Service Bus queue or something similar? Or maybe it makes sense to look into some other solutions like Redis?
So as I see, the requirements can be met by keeping only one item in a
queue for each stream. If there is a new snapshot, while the previous
is still there, we need to REPLACE it using stream id as a key. Is it
possible to implement such behavior by Service Bus queue or something
similar?
To the best of my knowledge, Azure Service Bus does not support this scenario. Through it's duplicate detection functionality, in fact it supports exact opposite of that. You would need to use some other mechanism (like Redis Cache you mentioned) to accomplish this.

How to store json-patch operations in redis queue and guarantee their consistency?

I have collaborative web application that handles JSON-objects like the following:
var post = {
id: 123,
title: 'Sterling Archer',
comments: [
{text: 'Comment text', tags: ['tag1', 'tag2', 'tag3']},
{text: 'Comment test', tags: ['tag2', 'tag5']}
]
};
My approach is using rfc6902 (JSONPatch) specification with jsonpatch library for patching JSON document. All such documents store in MongoDB database and as you know the last one very slow for frequent writes.
To get more speed and highload application I use redis as queue for a patch operations like the following:
{ "op": "add", "path": "/comments/2", "value": {text: 'Comment test3', tags: ['tag4']}" }
I just store all such patch operations in queue and at midnight run cron script to get all patches and construct full document and update it in MongoDB database.
I don't understand yet what should I do in case corrupted patch like:
{ "op": "add", "path": "/comments/0/tags/5", "value": 'tag4'}
The patch above don't gets applied to document above because tags array has length only 3 (according official docs https://www.rfc-editor.org/rfc/rfc6902#page-5)
The specified index MUST NOT be greater than the number of elements in the array.
So when user is online he don't get any errors because his patch operations get stored in redis queue but next day he get broken document due broken patch that don't got applied in cron script.
So my question if how could I guarantee that all patches that stored in redis queue is correct and don't corrupts primary document?
As with any system that can become inconsistent, you must allow for patches to be applied as quickly as possible if you wish to catch conflicts sooner and decrease the likelihood of running into them. That is likely your main issue if you are not notifying the other clients of any updated data as soon as possible (and are just waiting for the CRON to run to update the shared data that the other clients can access).
As others have asked, it's important to understand how a "bad" patch got into the operation queue in the first place. Here are some guesses from my standpoint:
A user had applied some operations that got lost in translation. How? I don't know, but it would explain the discrepancy.
Operations are not being applied in the correct order. How? I don't know. I have no code to go off of.
Although I have no code to go off of, I can take a shot in the dark and help you analyze the latter point. The first thing we need to analyze is the different scenarios that may come up with updating a "shared" resource. It's important to note that, in any system that must eventually be consistent, we care about the:
Order of the operations.
How we will deal with conflicts.
The latter is really up to you, and you will need a good notification/messaging system to update the "truth" that clients see.
Scenario 1
User A applies operations 1 & 2. The document is updated on the server and then User B is notified of this. User B was going to apply operations 3 & 4, but these operations (in this order) do not conflict with operations 1 & 2. All is well in the world. This is a good situation.
Scenario 2
User A applies operations 1 & 2. User B applies operations 3 & 4.
If you apply the operations atomically per user, you can get the following queues:
[1,2,3,4] [3,4,1,2]
Anywhere along the line, if there is a conflict, you must notify either User A or User B based on "who got there first" (or any other weighting semantics you wish to use). Again, how you deal with conflicts is up to you. If you have not read up on vector clocks, you should do so.
If you don't apply operations atomically per user, you can get the following queues:
[1,2,3,4] [3,4,1,2] [1,3,2,4] [3,1,4,2] [3,1,2,4] [1,3,4,2]
As you can see, forgoing atomic updates per user increases the combinations of updates and will therefore increase the likelihood of a collision happening. I urge you to ensure that operations are being added to the queue atomically per user.
A Recap
Some important things you should remember:
Make sure updates to the queue are atomically applied per user.
Figure out how you will deal with several versions of a shared resource arising from multiple mutations from different clients (again I suggest you read up on vector clocks).
Don't update a shared resource that may be accessed by several clients in real-time as a cron job.
When there is a conflict that cannot be resolved, figure out how you will deal with it.
As a result of point 3, you will need to come up with a notification system so that clients can get updated resources quickly. As a result of point 4, you may choose to include telling clients that something went wrong with their update. Something that has just come to the top of my head is that you're already using Redis, which has pub/sub capabilities.
EDIT:
It seems like Google Docs handles conflict resolutions with transformations. That is, by shifting whole characters/lines over to make way for a hybrid application of all operations: https://drive.googleblog.com/2010/09/whats-different-about-new-google-docs_22.html
As I had said before, it's all up to how you want to handle your own conflicts, which should largely be determined by the application/product itself and its use cases.
IMHO you are introducing unneeded complexity instead of simpler solution. These would be my alternate suggestions instead of your approach of a json patch cron which is very hard to make consistent and atomic.
Use mongodb only : With proper database design and indexing, and proper hdarware allocation/sharding, the write performance of mongodb is really fast. And the kind of operations you are using in jsonpatch are natively supported in mongodb BSON documents and their query language .e.g $push,$set,$inc,$pull etc.
Perhaps you want to not interrupt users activities with a syncronous write to Mongodb , for that the solution is using async queus as mentioned in point#2.
2.Use task queues & mongodb: Instead of storing patches in redis like you do now, you can push the patching task to a task queue, which will asyncronously do the mongodb update , and user will not experience any slow performance. One very good task queue is Celery , which can use Redis as a broker & messaging backend. So, each users updates get a single task, and will get applied to mongodb by the task queue, and there will be no performance hit.

Realtime synchronization of live data over network

How do you sync data between two processes (say client and server) in real time over network?
I have various documents/datasets constructed on the server, which are downloaded and displayed by clients. Once downloaded, the document receives continuous updates in order to remain fresh.
It seems to be a simple and commonly occurring concept, but I cannot find any tools that provide this level of abstraction. I am not even sure what I am looking for. Perhaps there is a similar concept with solid tool support? Perhaps there is a chain of different tools that must be put together? Here's what I have considered so far:
I am required to propagate every change in a single hop (0.5 RTT), which rules out polling (typically >10 RTT) and cache invalidation techniques (1.5 RTT).
Data replication and simple notification broadcasts are not an option, because there is too much data and too many changes. Clients must be able to select specific documents to download and monitor for changes.
I am currently using message passing pattern, which does the job, but it is hopelessly unproductive. It works at way too low level of abstraction. It is laborious, error-prone, and it doesn't scale well with increasing application complexity.
HTTP and other RPC-like techniques are good for the initial fetch, but they encourage polling for subsequent synchronization. When performing reverse requests (from data source to data consumer), change notifications are possible, but it's even more complicated than message passing.
Combining RPC (for the initial fetch) with message passing (for updates) turned out to be a nightmare due to the complexity involved in coordinating communication over the two parallel connections as well as due to the impedance mismatch between the two paradigms. I need something unified.
WebSocket & Comet are popular methods to implement change notification, but they need additional libraries to be productive and I am not aware of any libraries suitable for my application.
Message queues merely put an intermediary on the network while maintaining the basic message passing pattern. Custom message filters/routers allow me to get closer to the live document concept, but I feel like I am implementing custom middleware layer on top of the MQ.
I have tons of additional requirements (native observable data structure API on both ends, incremental updates, custom message filters, custom connection routing, cross-platform, robustness & scalability), but before considering those requirements, I need to find some tools that at least attempt to do what I need. I am trying to avoid in-house frameworks for the standard reasons - cost, time to market, long-term maintenance, and keeping developers happy.
My conclusion at the moment is that there is no such live document synchronization framework. In-house solution is the way to go, but many existing components can be used as part of the solution.
It is pretty simple to layer live document logic on top of WebSocket or any other message passing platform. Server just sends the document as a separate message when the connection is initiated and then after every change. Automated reconnection and some connection monitoring must be added to handle network failures.
Serialization at both ends is a separate problem targeted by many existing libraries. Detecting changes in server-side data structures (needed to initiate push) is yet another separate problem that has its own set of patterns and tools. Incremental updates and many other issues can be solved by intermediaries intercepting the connection.
This approach will work with current technology at the cost of extensive in-house glue code. It can be incrementally substituted with standard components as they become available.
WebSocket already includes resource URIs, routing, and a few other nice features. Useful intermediaries and libraries will likely emerge in the future. HTTP with text/event-stream MIME type is a possible future alternative to WebSocket. The advantage of HTTP is that existing tools can be reused with little modification.
I've completely thrown away the pattern of combining RPC pull with separate push channel despite rich tool support. Pushing everything in 0.5 RTT requires the push channel to use exactly the same technology as the pull channel, i.e. reverse RPC. Reverse RPC is like message passing except it introduces redundant returns, throws away useful connection semantics, and makes it hard to insert content-agnostic intermediaries into the stream.

Message queuing solution for millions of topics

I'm thinking about system that will notify multiple consumers about events happening to a population of objects. Every subscriber should be able to subscribe to events happening to zero or more of the objects, multiple subscribers should be able to receive information about events happening to a single object.
I think that some message queuing system will be appropriate in this case but I'm not sure how to handle the fact that I'll have millions of the objects - using separate topic for every of the objects does not sound good [or is it just fine?].
Can you please suggest approach I should should take and maybe even some open source message queuing system that would be reasonable?
Few more details:
there will be thousands of subscribers [meaning not plenty of them],
subscribers will subscribe to tens or hundreds of objects each,
there will be ~5-20 million of the objects,
events themselves dont have to carry any message. just information that that object was changed is enough,
vast majority of objects will never be subscribed to,
events occur at the maximum rate of few hundreds per second,
ideally the server should run under linux, be able to integrate with the rest of the ecosystem via http long-poll [using node js? continuations under jetty?].
Thanks in advance for your feedback and sorry for somewhat vague question!
I can highly recommend RabbitMQ. I have used it in a couple of projects before and from my experience, I think it is very reliable and offers a wide range of configuraions. Basically, RabbitMQ is an open-source ( Mozilla Public License (MPL) ) message broker that implements the Advanced Message Queuing Protocol (AMQP) standard.
As documented on the RabbitMQ web-site:
RabbitMQ can potentially run on any platform that Erlang supports, from embedded systems to multi-core clusters and cloud-based servers.
... meaning that an operating system like Linux is supported.
There is a library for node.js here: https://github.com/squaremo/rabbit.js
It comes with an HTTP based API for management and monitoring of the RabbitMQ server - including a command-line tool and a browser-based user-interface as well - see: http://www.rabbitmq.com/management.html.
In the projects I have been working with, I have communicated with RabbitMQ using C# and two different wrappers, EasyNetQ and Burrow.NET. Both are excellent wrappers for RabbitMQ but I ended up being most fan of Burrow.NET as it is easier and more obvious to work with ( doesn't do a lot of magic under the hood ) and provides good flexibility to inject loggers, serializers, etc.
I have never worked with the amount of amount of objects that you are going to work with - I have worked with thousands ( not millions ). However, no matter how many objects I have been playing around with, RabbitMQ has always worked really stable and has never been the source to errors in the system.
So to sum up - RabbitMQ is simple to use and setup, supports AMQP, can be managed via HTTP and what I like the most - it's rock solid.
Break up the topics to carry specific events for e.g. "Object updated topic" "Object deleted"...So clients need to only have to subscribe to the "finite no:" of event based topics they are interested in.
Inject headers into your messages when you publish them and put intelligence into the clients to use these headers as message selectors. For eg, client knows the list of objects he is interested in - and say you identify the object by an "id" - the id can be the header, and the client will use the "id header" to determine if he is interested in the message.
Depending on whether you want, you may also want to consider ensuring guaranteed delivery to make sure that the client will receive the message even if it goes off-line and comes back later.
The options that I would recommend top of the head are ActiveMQ, RabbitMQ and Redis PUB SUB ( Havent really worked on redis pub-sub, please use your due diligance)
Finally here are some performance benchmarks for RabbitMQ and Redis
Just saw that you only have few 100 messages getting pushed out / sec, this is not a big deal for activemq, I have been using Amq on a system that processes 240 messages per second , and it just works fine. I use a thread pool of workers to asynchronously process the messages though . Look at a framework like akka if you are in the java land, if not stick with nodejs and the cool Eco system around it.
If it has to be open source i'd go for ActiveMQ, and an application server to provide the JMS functionality for topics and it has Ajax Support so you can access them from your client
So, you would use the JMS infrastructure to publish the topics for the objects, and you can create topis as you need them
Besides, by using an java application server you may be able to take advantages from clustering, load balancing and other high availability features (obviously based on the selected product)
Hope that helps!!!
Since your messages are very small might want to consider MQTT, which is designed for small devices, although it works fine on powerful devices as well. Key consideration is the low overhead - basically a 2 byte header for a small message. You probably can't use any simple or open source MQTT server, due to your volume. You probably need a heavy duty dedicated appliance like a MessageSight to handle your volume.
Some more details on your application would certainly help. Also you don't mention security at all. I assume you must have some needs in this area.
Though not sure about your work environment but here are my bits. Can you identify each object with unique ID in your system. If so, you can have a topic per each event type. for e.g. you want to track object deletion event, object updation event and so on. So you can have topic for each event type. These topics would be published with Ids of object whenever corresponding event happened to the object. This will limit the no of topics you needed.
Second part of your problem is different subscribers want to subscribe to different objects. So not all subscribers are interested in knowing events of all objects. This problem statement scoped to message selector(filtering) mechanism provided by messaging framework. So basically you need to seek on what basis a subscriber interested in particular object. Have that basis as a message filtering mechanism. It could be anything: object type, object state etc. So ultimately your system would consists of one topic for each event type with someone publishing event messages : {object-type:object-id} information. Subscribers could subscribe to any topic and with an filtering criteria.
If above solution satisfy, you can use any messaging solution: activeMQ, WMQ, RabbitMQ.

Simple scalable work/message queue with delay

I need to set up a job/message queue with the option to set a delay for the task so that it's not picked up immediately by a free worker, but after a certain time (can vary from task to task). I looked into a couple of linux queue solutions (rabbitmq, gearman, memcacheq), but none of them seem to offer this feature out of the box.
Any ideas on how I could achieve this?
Thanks!
I've used BeanstalkD to great effect, using the delay option on inserting a new job to wait several seconds till the item becomes available to be reserved.
If you are doing longer-term delays (more than say 30 seconds), or the jobs are somewhat important to perform (abeit later), then it also has a binary logging system so that any daemon crash would still have a record of the job. That said, I've put hundreds of thousands of live jobs through Beanstalkd instances and the workers that I wrote were always more problematical than the server.
You could use an AMQP broker (such as RabbitMQ) and I have an "agent" (e.g. a python process built using pyton-amqplib) that sits on an exchange an intercepts specific messages (specific routing_key); once a timer has elapsed, send back the message on the exchange with a different routing_key.
I realize this means "translating/mapping" routing keys but it works. Working with RabbitMQ and python-amqplib is very straightforward.