Kafka have both partitioning and consumer group - partitioning

Can we use both partitioning and consumer group for same topic. So we want to create a topic which have partitioning and then create multiple consumers to it and out of them 2-3 will be generic listening to all messages(needs to be in a consumer group so that message in not processed multiple times) and then one consumer for specific partion.
Is partitioning and consumer are mutually exclusive?

With the high level consumer API you canĀ“t pin a consumer instance to a particular partition but there is nothing preventing you from having a set of consumers using the high level API and another set of consumers using the simple API for the same topic.
With this you could have a simple consumer consuming from a specific partition and a set of high level consumers in a consumer group consuming messages across all partitions.

High Level Consumer for kafka 0.8.x doesn't allow you to specify partition. It reads the data from all partitions and does complex failure detection and rebalancing. Probably it will be supported in future versions according to the API redesign - https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design#ConsumerClientRe-Design-Allowmanualpartitionassignment.
If you need to read from specific topic and partition - use SimpleConsumer(it requires manual leader/offset/exclusivity/error handling).
Or you can use high level consumer and filter the data by partition(if you accept the overhead).
Also as an option you can redesign your topics to write data to separate topic instead of 'specific partition'.

Related

What are best practices for partitioning DocumentDB across accounts?

I am developing an application that uses DocumentDB to store customer data. One of the requirements is that we segregate customer data by geographic region, so that US customers' data is stored within the US, and European customers' data lives in Europe.
The way I planned to achieve this is to have two DocumentDB accounts, since an account is associated with a data centre/region. Each account would then have a database, and a collection within that database.
I've reviewed the DocumentDB documentation on client- and server-side partitioning (e.g. 1, 2), but it seems to me that the built-in partitioning support will not be able to deal with multiple regions. Even though an implementation of IPartitionResolver could conceivably return an arbitrary collection self-link, the partition map is associated with the DocumentClient and therefore tied to a specific account.
Therefore it appears I will need to create my own partitioning logic and maintain two separate DocumentClient instances - one for the US account and one for the Europe account. Are there any other ways of achieving this requirement?
Azure's best practices on data partitioning says:
All databases are created in the context of a DocumentDB account. A
single DocumentDB account can contain several databases, and it
specifies in which region the databases are created. Each DocumentDB
account also enforces its own access control. You can use DocumentDB
accounts to geo-locate shards (collections within databases) close to
the users who need to access them, and enforce restrictions so that
only those users can connect to them.
So, if your intention is to keep the data near to user (and not just keep them stored separate) your only option is to create different accounts. Lucky that billing is not per account based but per collection based.
DocumentDB's resource model gives an impression that you can not (atleast out of the box) mix DocumentDB accounts. It doesn't look like partition keys are of any use too as partitions too can happen only within the same account.
May be this sample would help you or give some hints.

Technology stack for a multiple queue system

I'll describe the application I'm trying to build and the technology stack I'm thinking at the moment to know your opinion.
Users should be able to work in a list of task. These tasks are coming from an API with all the information about it: id, image urls, description, etc. The API is only available in one datacenter and in order to avoid the delay, for example in China, the tasks are stored in a queue.
So you'll have different queues depending of your country and once that you finish with your task it will be send to another queue which will write this information later on in the original datacenter
The list of task is quite huge that's why there is an API call to get the tasks(~10k rows), store it in a queue and users can work on them depending on the queue the country they are.
For this system, where you can have around 100 queues, I was thinking on redis to manage the list of tasks request(ex: get me 5k rows for China queue, write 500 rows in the write queue, etc).
The API response are coming as a list of json objects. These 10k rows for example need to be stored somewhere. Due to you need to be able to filter in this queue, MySQL isn't an option at least that I store every field of the json object as a new row. First think is a NoSQL DB but I wasn't too happy with MongoDB in the past and an API response doesn't change too much. Like I need relation tables too for other thing, I was thinking on PostgreSQL. It's a relation database and you have the ability to store json and filter by them.
What do you think? Ask me is something isn't clear
You can use HStore extension from PostgreSQL to store JSON, or dynamic columns from MariaDB (MySQL clone).
If you can move your persistence stack to java, then many interesting options are available: mapdb (but it requires memory and its api is changing rapidly), persistit, or mvstore (the engine behind H2).
All these would allow to store json with decent performances. I suggest you use a full text search engine like lucene to avoid searching json content in a slow way.

Which strategy to use for designing a log data storage?

We want to design a data storage with Relational database keeping the request message(http/s,xmpp etc.) logs. For generating logs we use a solution based on Apache synapse esb. However since we want to store the logs and read the logs only for maintenance issues the read/write ratio will be low. (write count will be intensive since system will receive many messages to be logged. ) We thought of using Cassandra for its distributed nature and clustering capabilities. However with Cassandra database schemas search queries with filter are difficult, always requiring secondary indexes.
To keep it short my question is whether should we try the clustering solutions of mysql or using Cassandra with suitable schema design for search queries with filters?
If you wish to do real time analytics over your semi-structured or unstructured data you can go with Cassandra + Hadoop cluster. Since Cassandra wiki itself suggests Datastax Brisk edition, for such kind of architecture. It is worth giving it a try
On the other side if you wish to do realtime queries over raw logs for small set of data. Ex.
select useragent from raw_log_table where id='xxx'
Then you should do a lots of research over you row key and column key design. Because that decides the complexity of the query. Better have a look at the case studies of people here http://www.datastax.com/cassandrausers1
Regards,
Tamil

Implementing message priority in AMQP

I'm intending to use AMQP to allow a distributed collection of machines to report to a central location asynchronously. The idea is to drop messages into the queue and allow the central logging entity to process the queue in a decoupled fashion; the 'process' is simply to create or update a row in a database table.
A problem that I'm anticipating is the effect of network jitter in the message queuing process - what happens if an update accidentally gets in front of an insert because the time between the two messages being issued is less than the network jitter?
Reading the AMQP spec, it seems that I could just apply a higher priority to inserts so they skip the queue and get processed first. But presumably this only applies if a queue actually exists at the broker to be skipped. Is there a way to impose a buffer or delay at the broker to absorb this jitter and allow priority to be enacted before the messages are passed on to the consumer(s)?
Or do I have to go down the route of a resequencer as ActiveMQ suggests?
The lack of ordering between multiple publishers has nothing to do with network jitter, it's a completely natural thing in distributed applications. Messages from the same publisher will always be ordered. If you really need causal ordering of actions performed by different nodes then either a resequencer or a global sequence numbering scheme are your only options. Note that you cannot use sender timestamps for this, which is what everyone seems to try first..

Using SQL 2008 ServiceBroker for high-volume threadsafe FIFO queue

I'm just starting to evaluate ServiceBroker to determine if it can perform as a reliable queue in a very specific context. Here is the scenario:
(1) need to pre-calculate a large (several million) population of computationally expensive values and store in a queue.
(2) multiple processes will attempt to read/dequeue these values at run time on an as-needed basis. could be several hundred + reads per second.
(3) a monitor process will occasionally poll the queue and determine if the population minimum threshold has been reached, and will then re-populate the queue.
Due to some infrastructure/cost constraints, an industrial strength Queue (websphere) might not be an option. What I have seen thus far of Service Broker is not encouraging because it seems to be isolated to a "conversation" with 2 endpoints and in my scenario, my reads happen completely independent of my writes. Does anyone have any insight as to whether this is possible with SQL Service Broker?
Although Service Broker has not been designed for such scenarios, I think with a little tweaking it could work in you case.
One approach would be to pre-create a pool of conversations and then have the calculating process round-robin between theses conversations when storing values. Since receiving from a queue takes a lock on the conversation, the number of conversations essentially sets an upper bound on how many processes may dequeue values concurrently. I'm not sure about that, but you may need some logic on the receiver side to tell explicitly what conversation to receive from (in order to achieve better load balancing than default receive behavior would get).
If perf is not a concern then you may even drop the conversation pool idea and send each message on a separate dialog, which would make the implementation way simpler at the cost of significant perf hit.
All the above is said assuming the values may be dequeued in a random order, otherwise you need to guarantee the receive ordering by using a single conversation.