Persisting real-time data through Cygnus to Cosmos is slow and unreliable - fiware

Cygnus version is 0.8.2 and I'm using the public instance of Cosmos from our FI-Ware instance inside the FI-Ware Lab.
I have 8 sensor devices that push updates to IDAS. Some updates come once per second, some once per 5 seconds, in average around 8,35 updates per second. I created subscriptions to Orion (version 0.22) to send ONCHANGE notifications to the Cygnus.
Cygnus is configured to persist data to Cosmos, Mongo and MySQL. I used the standard configuration where is 1 source (http-source), 3 channel (hdfs-channel mysql-channel mongo-channel) and 3 sink (hdfs-sink mysql-sink mongo-sink).
mysql-sink and mongo-sink persist data near real-time. However, the hdfs-sink is really slow, only about 1,65 events per seconds. As the http-source receives around 8,35 events per second, the hdfs-channel is soon full and you get a warning to the log file.
time=2015-07-30T13:39:02.168CEST | lvl=WARN | trans=1438256043-345-0000002417 | function=doPost | comp=Cygnus | msg=org.apache.flume.source.http.HTTPSource$FlumeHTTPServlet[203] : Error appending event to channel. Channel might be full. Consider increasing the channel capacity or make sure the sinks perform faster.
org.apache.flume.ChannelException: Unable to put batch on required channel: org.apache.flume.channel.MemoryChannel{name: hdfs-channel}
at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:200)
at org.apache.flume.source.http.HTTPSource$FlumeHTTPServlet.doPost(HTTPSource.java:201)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:725)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:814)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:401)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: org.apache.flume.ChannelException: Space for commit to queue couldn't be acquired Sinks are likely not keeping up with sources, or the buffer size is too tight
at org.apache.flume.channel.MemoryChannel$MemoryTransaction.doCommit(MemoryChannel.java:128)
at org.apache.flume.channel.BasicTransactionSemantics.commit(BasicTransactionSemantics.java:151)
at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:192)
... 16 more
Side effect is that if the http-source cannot inject the notification to the hdfs-channel, it doesn't inject it to the mysql-channel and mongo-channel either and that notification is totally lost. It's not persisted to anywhere.
You can circumvent the problem partly by launching 3 separate Cygnuses (one for Cosmos, one for MySQL and one for MongoDB) with different http-source port, different Management Interface port and adding subscriptions for each Cygnus. MySQL and MongoDB persisting is not affected by the hdfs-channel becoming full, but the Cosmos persisting still has a problem. Adding more hdfs-sinks might do the trick with our 8 sensor devices, but if you add more sensor devices or they send more updates, you are just postponing the problem.
These 2 question are a bit unrelated, but I'm asking anyway...
Question 1: Is it really the case that persisting to the Cosmos is that slow?
I know there is a lot of going on behind the scenes compared to persisting to the local databases and that we are using the public instance of Cosmos that is resource limited, but still. Is it even meant to be used that way with real-time data (our 8 sensor device test is even quite modest)? Of course it's possible to create a sink that pushes the data to a file and then do a simple file upload to the Cosmos, but it's a bit more hassle. I guess there isn't such file-sink available?
Question 2: Is it really the case that if the notification cannot be injected to the hdfs-channel (I guess any channel), it's not added to other channels either and it's discarded totally?

All the sinks design is quite similar, nevertheless there are some differences between the HDFS sink and MySQL/MongoDB sinks:
The HDFS endpoint (the HttpFS server running at cosmos.lab.fiware.org:14000) is shared among many FIWARE users. However, I guess your MySQL and MongoDB deployments were private ones and thus only used by you.
The HDFS sink is based on WebHDFS, a REST API, while MySQL and MongoDB sinks are based on "binary protocols" (JDBC and Mongo driver are used respectively). There is an old issue at Github about moving to a "binary" implementation of the sink.
Being said that, and trying to fix the problem with the current implementation these are my recommendations:
Try to change the looging level to ERROR; logging traces consumes a lot of resources.
Try to send "batches" of notifications to Cygnus (a Orion notification may contain several context entity elements); each batch is stored as a single Flume event in the channel.
As you already figured out, try to configure more than one HDFS sink, this is explained here (reading the full document is also a good idea).
Nevertheless, if the bottleneck is on the HDFS endpoint, I figure out this won't fix anything.
About Cygnus does not put an event in other non HDFS channels if it cannot be persisted in the HDFS channel, I'll have a look on that. Cygnus relies on Apache Flume and the events delivering feature is within the core of Flume, so it seems to be a bug/problem about Flume.

Related

How to deploy an eth2 rpc node to obtain transactions?

I want to deploy an eth rpc node and only need to obtain transaction data related to me. Do I only need to use Geth? Do I need to use Prysm?
The transaction data I get using this way:
https://github.com/Adamant-im/ETH-transactions-storage
This method is found from How to get Ethereum transaction list by address
I see some introduction here SYNCHRONIZATION MODES https://ethereum.org/en/developers/docs/nodes-and-clients/#sync-modes
There are several modes as follows:
Full sync / Fast sync / Light sync / Snap sync / Optimistic sync / Checkpoint sync
After reading the description, I still don't know which one is the most suitable for only obtaining transaction data.
I'm a novice please give me an example of starting the functionality I need.

Kafka producer vs Kafka connect to read MySQL Datasource

I have created a kafka producer that reads website click data streams from MySQL database and it works well. I found out that I can also just connect kafka to MySQL datasource using kafka connect or debezium. My target is to ingest the data using kafka and send it to Storm to consume and do analysis. It looks like both ways can achieve my target but using kafka producer may require me to build a kafka service that keeps reading the datasource.
Which of the two approaches would be more efficient for my data pipe line?
I'd advice to not re-invent the wheel and use Debezium (disclaimer: I'm its project lead).
It's feature-rich (supported data types, configuration options, can do initial snapshotting etc.) and well tested in production. Another key aspect to keep in mind is that Debezium is based on reading the DB's log instead of polling (you might do the same in your producer, it's not clear from the question). This provides many advantages over polling:
no delay as with low-frequency polls, no CPU load as with high-frequency polls
can capture all changes without missing some between two polls
can capture DELETEs
no impact on schema (doesn't need a column to identify altered rows)

How to recover from NoReplicaOnlineException with one Kafka broker?

We have a really simple Kafka 0.8.1.1 set up in our development lab. It's just one node. Periodically, we run into this error:
[2015-08-10 13:45:52,405] ERROR Controller 0 epoch 488 initiated state change for partition [test-data,1] from OfflinePartition to OnlinePartition failed (state.change.logger)
kafka.common.NoReplicaOnlineException: No replica for partition [test-data,1] is alive. Live brokers are: [Set()], Assigned replicas are: [List(0)]
at kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:61)
at kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:336)
at kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:185)
at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:99)
at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:96)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:743)
Can anyone recommend a strategy for recovering from this? Is there such a thing or do we need to build out another node or two and set up the replication factor on our topics to cover all of the nodes that we put into the cluster?
We have 3 zookeeper nodes that respond very well for other applications like Storm and HBase, so we're pretty confident that ZooKeeper isn't to blame here. Any ideas?
This question is about Kafka 0.8 which should be out of support if I am not mistaken. However, for future readers the following guidelines should be relevant:
If you care about stability, uptime, reliability or anything in this general direction, make sure you have at least 3 kafka Nodes.
If you have a problem in an old kafka version, seriously consider upgrading to the latest kafka version. At time of writing we are already at Kafka 2

Is Zookeeper a must for Kafka? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
In Kafka, I would like to use only a single broker, single topic and a single partition having one producer and multiple consumers (each consumer getting its own copy of data from the broker). Given this, I do not want the overhead of using Zookeeper; Can I not just use the broker only? Why is a Zookeeper must?
Yes, Zookeeper is required for running Kafka. From the Kafka Getting Started documentation:
Step 2: Start the server
Kafka uses zookeeper so you need to first start a zookeeper server if
you don't already have one. You can use the convenience script
packaged with kafka to get a quick-and-dirty single-node zookeeper
instance.
As to why, well people long ago discovered that you need to have some way to coordinating tasks, state management, configuration, etc across a distributed system. Some projects have built their own mechanisms (think of the configuration server in a MongoDB sharded cluster, or a Master node in an Elasticsearch cluster). Others have chosen to take advantage of Zookeeper as a general purpose distributed process coordination system. So Kafka, Storm, HBase, SolrCloud to just name a few all use Zookeeper to help manage and coordinate.
Kafka is a distributed system and is built to use Zookeeper. The fact that you are not using any of the distributed features of Kafka does not change how it was built. In any event there should not be much overhead from using Zookeeper. A bigger question is why you would use this particular design pattern -- a single broker implementation of Kafka misses out on all of the reliability features of a multi-broker cluster along with its ability to scale.
As explained by others, Kafka (even in most recent version) will not work without Zookeeper.
Kafka uses Zookeeper for the following:
Electing a controller. The controller is one of the brokers and is responsible for maintaining the leader/follower relationship for all the partitions. When a node shuts down, it is the controller that tells other replicas to become partition leaders to replace the partition leaders on the node that is going away. Zookeeper is used to elect a controller, make sure there is only one and elect a new one it if it crashes.
Cluster membership - which brokers are alive and part of the cluster? this is also managed through ZooKeeper.
Topic configuration - which topics exist, how many partitions each has, where are the replicas, who is the preferred leader, what configuration overrides are set for each topic
(0.9.0) - Quotas - how much data is each client allowed to read and write
(0.9.0) - ACLs - who is allowed to read and write to which topic
(old high level consumer) - Which consumer groups exist, who are their members and what is the latest offset each group got from each partition.
[from https://www.quora.com/What-is-the-actual-role-of-ZooKeeper-in-Kafka/answer/Gwen-Shapira]
Regarding your scenario, only one broker instance and one producer with multiple consumer, u can use pusher to create a channel, and push event to that channel that consumer can subscribe to and hand those events.
https://pusher.com/
Important update - August 2019:
ZooKeeper dependency will be removed from Apache Kafka. See the high-level discussion in KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum.
These efforts will take a few Kafka releases and additional KIPs. Kafka Controllers will take over the tasks of current ZooKeeper tasks. The Controllers will leverage the benefits of the Event Log which is a core concept of Kafka.
Some benefits of the new Kafka architecture are a simpler architecture, ease of operations, and better scalability e.g. allow "unlimited partitions".
Updated on Oct 2022
For new clusters in the 3.3 release you can use Apache Kafka without ZooKeeper (in new mode, called KRaft mode) in production.
Apache Kafka Raft (KRaft) is the consensus protocol that was introduced to remove Apache Kafka’s dependency on ZooKeeper for metadata management. The development progress is tracked in KIP-500.
KRaft mode was released in early access in Kafka 2.8. It was not suitable for production before 3.3 version (see details in KIP-833: Mark KRaft as Production Ready)
1. Benefits of Kafka’s new quorum controller
Enables Kafka clusters to scale to millions of partitions through improved control plane performance with the new metadata management
Improves stability, simplifies the software, and makes it easier to monitor, administer, and support Kafka.
Allows Kafka to have a single security model for the whole system
Provides a lightweight, single process way to get started with Kafka
Makes controller failover near-instantaneous
2. Timeline
Note: this timeline is very rough and subject to change.
2022/10: KRaft mode declared production-ready in Kafka 3.3
2023/02: Upgrade from ZK mode supported in Kafka 3.4 as early access.
2023/04: Kafka 3.5 released with both KRaft and ZK support. Upgrade from ZK goes production. ZooKeeper mode deprecated.
2023/10: Kafka 4.0 released with only KRaft mode supported.
References:
KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum
Apache Kafka Needs No Keeper: Removing the Apache ZooKeeper Dependency
Preparing Your Clients and Tools for KIP-500: ZooKeeper Removal from Apache Kafka
KRaft: Apache Kafka Without ZooKeeper
Kafka is built to use Zookeeper. There is no escaping from that.
Kafka is a distributed system and uses Zookeeper to track status of kafka cluster nodes. It also keeps track of Kafka topics, partitions etc.
Looking at your question, it seems you do not need Kafka. You can use any application that supports pub-sub such as Redis, Rabbit MQ or hosted solutions such as Pub-nub.
IMHO Zookeeper is not an overhead but makes your life a lot easier.
It is basically used to maintain co-ordination between different nodes in a cluster. One of the most important things for Kafka is it uses zookeeper to periodically commit offsets so that in case of node failure it can resume from the previously committed offset (imagine yourself taking care of all this by your own).
Zookeeper also plays a vital role for serving many other purposes, such as leader detection, configuration management, synchronization, detecting when a new node joins or leaves the cluster, etc.
Future Kafka releases are planning to remove the zookeeper dependency but as of now it is an integral part of it.
Here are a few lines taken from their FAQ page:
Once the Zookeeper quorum is down, brokers could result in a bad state and could not normally serve client requests, etc. Although when Zookeeper quorum recovers, the Kafka brokers should be able to resume to normal state automatically, there are still a few corner cases the they cannot and a hard kill-and-recovery is required to bring it back to normal. Hence it is recommended to closely monitor your zookeeper cluster and provision it so that it is performant.
For more details check here
Zookeeper is centralizing and management system for any kind of distributed systems. Distributed system is different software modules running on different nodes/clusters (might be on geographically distant locations) but running as one system. Zookeeper facilitates communication between the nodes, sharing configurations among the nodes, it keeps track of which node is leader, which node joins/leaves, etc. Zookeeper is the one who keeps distributed systems sane and maintains consistency. Zookeeper basically is an orchestration platform.
Kafka is a distributed system. And hence it needs some kind of orchestration for its nodes that might be geographically distant (or not).
Apache Kafka v2.8.0 gives you early access to KIP-500 that removes the Zookeeper dependency on Kafka which means it no longer requires Apache Zookeeper.
Instead, Kafka can now run in Kafka Raft metadata mode (KRaft mode) which enables an internal Raft quorum. When Kafka runs in KRaft mode its metadata is no longer stored on ZooKeeper but on this internal quorum of controller nodes instead. This means that you don't even have to run ZooKeeper at all any longer.
Note however that v2.8.0 is currently early access and you should not use Zookeeper-less Kafka in production for the time being.
A few benefits of removing ZooKeeper dependency and replacing it with an internal quorum:
More efficient as controllers no longer need to communicate with ZooKeeper to fetch cluster state metadata every time the cluster is starting up or when a controller election is being made
More scalable as the new implementation will be able to support many more topics and partitions in KRaft mode
Easier cluster management and configuration as you don't have to manage two distinct services any longer
Single process Kafka Cluster
For more details you can read the article Kafka No Longer Requires ZooKeeper
Yes, Zookeeper is must by design for Kafka. Because Zookeeper has the responsibility a kind of managing Kafka cluster. It has list of all Kafka brokers with it. It notifies Kafka, if any broker goes down, or partition goes down or new broker is up or partition is up. In short ZK keeps every Kafka broker updated about current state of the Kafka cluster.
Then every Kafka client(producer/consumer) all need to do is connect with any single broker and that broker has all metadata updated by Zookeeper, so client need not to bother about broker discovery headache.
Other than the usual payload message transfer, there are many other communications that happens in kafka, like
Events related to brokers requesting the cluster membership.
Events related to Brokers becoming available.
Getting bootstrap config setups.
Events related to controller and leader updates.
Help status updates like Heartbeat updates.
Zookeeper itself is a distributed system consisting of multiple nodes in an ensemble. Zookeeper is centralised service for maintaining such metadata.
This article explains the role of Zookeeper in Kafka. It explains how kafka is stateless and how zookeper plays an important role in distributed nature of kafka (and many more distributed systems).
The request to run Kafka without Zookeeper seems to be quite common. The library Charlatan addresses this.
According to the description is Charlatan more or less a mock for Zookeeper, providing the Zookeeper services either backed up by other tools or by a database.
I encountered that library when dealing with the main product of the authors for the Charlatan library; there it works fine …
Firstly
Apache ZooKeeper is a distributed store which is used to provide configuration and synchronization services in a high available way.
In more recent versions of Kafka, work was done in order for the client consumers to not store information about how far it had consumed messages (called offsets) into ZooKeeper.This reduced usage did not get rid of the need for consensus and coordination in distributed systems however. While Kafka provides fault-tolerance and resilience, something is needed in order to provide the coordination needed and ZooKeeper enables that piece of the overall system.
Secondly
Agreeing on who the leader of a partition is, is one example of the practical application of ZooKeeper within the Kafka ecosystem.
Zookeeper would work if there was even a single broker.
These are from Kafka In Action book.
Image is from this course

A backup persistence store. What are my options?

I have a service that accepts callbacks from a provider.
Motivation: I do not want to EVER lose any callbacks (unless of course my network becomes unreachable).
Let's suppose the impossible happens and my mysql server becomes unreachable for some time,
I want to fallback to a secondary persistence store once I've retried several times and fail.
What are my options? Queues, in-memory cache ?
You say you're receiving "Callbacks" - you've not made clear what they are. What is the protocol? Is it over a network.
If it were HTTP, then I would say the best way is that if your application is unable to write the data into permanent storage, it should return an error ("Try again later" if that exists in the protocol) to the caller, who should try again later.
An asynchronous process like a callback should always be able to cope with failures downstream and queue its requests.
I've worked with a payment provider where this has been the case (Paypal). If you're unable to completely process the request, just send an error back to the caller.
I recommend some sort of job queue server. I personally use Starling and have had great results with it. It speaks the memcache protocol so it is easy to use as a persistent queue.
Starling on Github
I've put a queue in SQLite for this before. Though, in my case, it was to protect against loss of the network link to the MySQL server — the data was locally-generated.
You can have a backup MySQL server, and switch your connection to that one in case primary one breaks down. If it's going to be only fail-over store you could probably run it locally on the application server.