Hi I am pretty new to apache kafka, I dont know how much sense this will make.
I did lot of research and couldn't find whats the advantage of multiple brokers.
Went through the whole kafka documentation and couldn't find an answer for this.
Say for example I am receiving data from two different set of devices which I should manipulate and store.Depending on from which set of device data arrives the consumer will change.
Should I go with multi broker - single topic - multi partition OR single broker - single topic - multi partition OR some other approach ??
Any help or guide is appreciated.
As with pretty much any distributed system: scalability and resiliency. One broker goes down - no problem if you have replication set up. You suddenly get a traffic spike which would be too much for a single machine to handle - no problem if you have a cluster of machines to handle the traffic.
Related
We have a really simple Kafka 0.8.1.1 set up in our development lab. It's just one node. Periodically, we run into this error:
[2015-08-10 13:45:52,405] ERROR Controller 0 epoch 488 initiated state change for partition [test-data,1] from OfflinePartition to OnlinePartition failed (state.change.logger)
kafka.common.NoReplicaOnlineException: No replica for partition [test-data,1] is alive. Live brokers are: [Set()], Assigned replicas are: [List(0)]
at kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:61)
at kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:336)
at kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:185)
at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:99)
at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:96)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:743)
Can anyone recommend a strategy for recovering from this? Is there such a thing or do we need to build out another node or two and set up the replication factor on our topics to cover all of the nodes that we put into the cluster?
We have 3 zookeeper nodes that respond very well for other applications like Storm and HBase, so we're pretty confident that ZooKeeper isn't to blame here. Any ideas?
This question is about Kafka 0.8 which should be out of support if I am not mistaken. However, for future readers the following guidelines should be relevant:
If you care about stability, uptime, reliability or anything in this general direction, make sure you have at least 3 kafka Nodes.
If you have a problem in an old kafka version, seriously consider upgrading to the latest kafka version. At time of writing we are already at Kafka 2
I've created a nightly sync between two database applications for a small construction company and setup simple notifications using database mail to let a few people know if the load was successful or not. Now that they see this notification is working I've been asked to provide status updates to their clients as employees make changes to the work order throughout the day.
I've done some research and understand DB Mail is not designed for this type of feature but I'm thinking the frequency will be small enough to not be a problem. I'm estimating 50-200 emails per day.
I couldn't find anything on the actual limitations of DB Mail and wondering if anyone has tried something similar in the past or if I could be pushed in the right direction to send these emails using best practice.
If we're talking hundreds here you can definitely go ahead. Take a peak at the Database Mail MSDN page. The current design (i.e. anything post-SQL2000) was specifically designed for large, high-performance enterprise implementations. Built on top of Service Broker (SQL Server's message queuing bus) it offers both asynchronous processing and scalability with process isolation, clustering, and failover. One caveat is increased transaction log pressure as messages, unlike in some other implementations, are ACID-protected by SQL Server which in turn gives you full recoverability of the queues in case of failure.
If you're wondering what Service Broker can handle before migrating to a dedicated solution, there's a great MySpace case study. The most interesting fragment:
We didn’t want to start down the road of using Service Broker unless
we could demonstrate that it could handle the levels of messages that
we needed to support our millions of users across 440 database
servers,” says Stelzmuller. “When we went to the lab we brought our
own workloads to ensure the quality of the testing. We needed to see
if Service Broker could handle loads of 4,000 messages per second. Our
testing found it could handle more than 18,000 messages a second. We
were delighted that we could build our solution using Service Broker,
rather than creating a custom solution on our own.
With the recent release of MySQL cluster that includes the memcache api, and provides persistency for memcache, can anyone share the benefits of which one to choose?
My initial idea is to use couchbase as a memcache layer between the app and db.
first, as a disclosure, I'm part of the MySQL Cluster team
Speaking specifically for MySQL Cluster, there is a decent blog here that takes you through the design rationale, benefits and implementation:
http://www.clusterdb.com/mysql-cluster/scalabale-persistent-ha-nosql-memcache-storage-using-mysql-cluster/
There is also a webinar recorded last month that demonstrates how to build a social networking type service with Memcached API and MySQL Cluster (registration required):
http://www.mysql.com/news-and-events/on-demand-webinars/display-od-723.html
I don't have any specific comparisons with Couchbase, but I hope the resources above will be useful in helping you determine if MySQL Cluster is right for your project
Recently I had to answer the same question for myself. I wanted to pick a DB for social application that works intensively with a lot of data and many users. As you can see my needs were not expected to outline some big technical differences between these DBs - after all both are used quite often for such applications.
Anyway, I prepared some load tests based on the scenarios I was interested in. And at the end I run the tests continuously for two days on strong hardware. I would say that more or less both products showed the same amazing performance. I had some crash tests too - dumb cases where I try to eat all the resources and do "stupid" things. On these tests I had lower expectations for Couchbase but it surprised me quite positively and it actually did better job than MongoDB. So, for the needs I had, from technical and architecture perspective I did not find much differences between Couchbase and MySQL Cluster. My advice is to run some tests based on your scenarios and see for yourself.
In my case the choice at the end was based on the license agreement, the statements in the legal notice, the pricing and the community around the product - as a result I picked MySQL Cluster.
I have a python application where I want to start doing more work in the background so that it will scale better as it gets busier. In the past I have used Celery for doing normal background tasks, and this has worked well.
The only difference between this application and the others I have done in the past is that I need to guarantee that these messages are processed, they can't be lost.
For this application I'm not too concerned about speed for my message queue, I need reliability and durability first and formost. To be safe I want to have two queue servers, both in different data centers in case something goes wrong, one a backup of the other.
Looking at Celery it looks like it supports a bunch of different backends, some with more features then the others. The two most popular look like redis and RabbitMQ so I took some time to examine them further.
RabbitMQ:
Supports durable queues and clustering, but the problem with the way they have clustering today is that if you lose a node in the cluster, all messages in that node are unavailable until you bring that node back online. It doesn't replicated the messages between the different nodes in the cluster, it just replicates the metadata about the message, and then it goes back to the originating node to get the message, if the node isn't running, you are S.O.L. Not ideal.
The way they recommend to get around this is to setup a second server and replicate the file system using DRBD, and then running something like pacemaker to switch the clients to the backup server when it needs too. This seems pretty complicated, not sure if there is a better way. Anyone know of a better way?
Redis:
Supports a read slave and this would allow me to have a backup in case of emergencies but it doesn't support master-master setup, and I'm not sure if it handles active failover between master and slave. It doesn't have the same features as RabbitMQ, but looks much easier to setup and maintain.
Questions:
What is the best way to setup celery
so that it will guarantee message
processing.
Has anyone done this before? If so,
would be mind sharing what you did?
A lot has changed since the OP! There is now an option for high-availability aka "mirrored" queues. This goes pretty far toward solving the problem you described. See http://www.rabbitmq.com/ha.html.
You might want to check out IronMQ, it covers your requirements (durable, highly available, etc) and is a cloud native solution so zero maintenance. And there's a Celery broker for it: https://github.com/iron-io/iron_celery so you can start using it just by changing your Celery config.
I suspect that Celery bound to existing backends is the wrong solution for the reliability guarantees you need.
Given that you want a distributed queueing system with strong durability and reliability guarantees, I'd start by looking for such a system (they do exist) and then figuring out the best way to bind to it in Python. That may be via Celery & a new backend, or not.
I've used Amazon SQS for this propose and got good results. You will recieve message until you will delete it from queue and it allows to grow you app as high as you will need.
Is using a distributed rendering system an option? Normally reserved for HPC but alot of concepts are the same. Check out Qube or Deadline Render. There are other, open source solutions as well. All have failover in mind given the high degree of complexity and risk of failure in some renders that can take hours per image sequence frame.
I am currently facing one problem which not yet figure out good solution, so hope to get some advice from you all.
My Problem as in the picture
Core Database is where all the clients connect to for managing live data which is really really big and busy all the time.
Feature Database is not used so often but it need some part of live data (maybe 5%) from the Core Database, But the request task to this server will take longer time and consume much resource.
What is my current solution:
I used database replication between Core Database & Feature Database, it works fine. But
the problem is that I waste a lot of disk space to store unwanted data.
(Filtering while replicate data is not work with my databases schema)
Using queueing system will not make data live on time as there are many request to Core Database.
Please suggest some idea if you have met this?
Thanks,
Pang
What you define is a classic data integration task. You can use any data integration tool to extract data from your core database and load into featured database. You can schedule your data integration jobs from real-time to any time-frame.
I used Talend in my mid-size (10GB) semi-scientific PostgreSQL database integration project. It worked beautifully.
You can also try SQL Server Integration Services (SSIS). This tool is very powerful as well. It works with all top-notch RDBMSs.
If all you're worrying about is disk space, I would stick with the solution you have right now. 100GB of disk space is less than a dollar, these days - for that money, you can't really afford to bring a new solution into the system.
Logically, there's also a case to be made for keeping the filtering in the same application - keeping the responsibility for knowing which records are relevant inside the app itself, rather than in some mysterious integration layer will reduce overall solution complexity. Only accept the additional complexity of a special integration layer if you really need to.