I am using Kafka and Zookeeper as the main components of my data pipeline, which is processing thousands of requests each second. I am using Samza as the real time data processing tool for small transformations that I need to make on the data.
My problem is that one of my consumers (lets say ConsumerA) consumes several topics from Kafka and processes them. Basically creating a summary of the topics that are digested. I further want to push this data to Kafka as a separate topic but that forms a loop on Kafka and my component.
This is what bothers me, is this a desired architecture in Kafka?
Should I rather do all the processing in Samza and store only the digested (summary) information to the Kafka from Samza. But the amount of processing I am going to do is quite heavy, that is why I want to use a separate component for it (ComponentA). I guess my question can be generalized to all kind of data pipelines.
So is it a good practice for a component to be a consumer and a producer in a data pipeline?
As long as Samza is writing to different topics than it is consuming from, no, there will be no problem. Samza jobs that read from and write to Kafka are the norm and intended by the architecture. One can also have Samza jobs that bring some data in from another system, or jobs that write some data from Kafka out to a different system (or even jobs that don't use Kafka at all).
Having a job read from and write to the same topic, is, however, where you'd get a loop and to be avoided. This has the potential to fill up your Kafka brokers' disks really fast.
Related
I am implementing data aggregation application framework. It should be consuming create, update and delete operations on MySQL databases which are sharded across multiple instances.
These aggregations should be stored in a separate database and serve as read store for eg. statistical overviews.
I've come up with two solutions, one of which is using Kafka consumers to write this data to aggregation database. Other one is creating background jobs for Beanstalkd on application level (whenever data is persisted) and consuming them with workers which would do the same job as Kafka consumers.
I'm interested in your thoughts on in pros and cons of both approaches in terms of performance, scalability and fault tolerance.
I have created a kafka producer that reads website click data streams from MySQL database and it works well. I found out that I can also just connect kafka to MySQL datasource using kafka connect or debezium. My target is to ingest the data using kafka and send it to Storm to consume and do analysis. It looks like both ways can achieve my target but using kafka producer may require me to build a kafka service that keeps reading the datasource.
Which of the two approaches would be more efficient for my data pipe line?
I'd advice to not re-invent the wheel and use Debezium (disclaimer: I'm its project lead).
It's feature-rich (supported data types, configuration options, can do initial snapshotting etc.) and well tested in production. Another key aspect to keep in mind is that Debezium is based on reading the DB's log instead of polling (you might do the same in your producer, it's not clear from the question). This provides many advantages over polling:
no delay as with low-frequency polls, no CPU load as with high-frequency polls
can capture all changes without missing some between two polls
can capture DELETEs
no impact on schema (doesn't need a column to identify altered rows)
I'll describe the application I'm trying to build and the technology stack I'm thinking at the moment to know your opinion.
Users should be able to work in a list of task. These tasks are coming from an API with all the information about it: id, image urls, description, etc. The API is only available in one datacenter and in order to avoid the delay, for example in China, the tasks are stored in a queue.
So you'll have different queues depending of your country and once that you finish with your task it will be send to another queue which will write this information later on in the original datacenter
The list of task is quite huge that's why there is an API call to get the tasks(~10k rows), store it in a queue and users can work on them depending on the queue the country they are.
For this system, where you can have around 100 queues, I was thinking on redis to manage the list of tasks request(ex: get me 5k rows for China queue, write 500 rows in the write queue, etc).
The API response are coming as a list of json objects. These 10k rows for example need to be stored somewhere. Due to you need to be able to filter in this queue, MySQL isn't an option at least that I store every field of the json object as a new row. First think is a NoSQL DB but I wasn't too happy with MongoDB in the past and an API response doesn't change too much. Like I need relation tables too for other thing, I was thinking on PostgreSQL. It's a relation database and you have the ability to store json and filter by them.
What do you think? Ask me is something isn't clear
You can use HStore extension from PostgreSQL to store JSON, or dynamic columns from MariaDB (MySQL clone).
If you can move your persistence stack to java, then many interesting options are available: mapdb (but it requires memory and its api is changing rapidly), persistit, or mvstore (the engine behind H2).
All these would allow to store json with decent performances. I suggest you use a full text search engine like lucene to avoid searching json content in a slow way.
http://pubapi.cryptsy.com/api.php?method=marketdatav2
I would like to synchronize market data on a continuous basis (e.g. cryptsy and other exchanges). I would like to show latest buy/sell price from the respective orders from these exchanges on a regular basis as a historical time series.
What backend database should I used to store and render or plot any parameter from the retrieved data as a historical timeseries data.
I'd suggest you look at a database tuned for handling time series data. The one that springs to mind is InfluxDB. This question has a more general take on time series databases.
I think it needs more detail about the requirement.
It just describe, "it needs sync time series data". What is scenario? what is data source and destination?
Option 1.
If it is just data synchronization issues between two data based, easiest solution is CouchDB NoSQL Series (CouchDB, CouchBase, Cloudant)
All they are based on CouchDB, anyway they provides data center level data replication feature (XCDR). So you can replicate the date to other couchDB in other data center or even in couchDB in mobile devices.
I hope it will be useful to u.
Option 2.
Other approach is Data Integration approach. You can sync data by using ETL batch job. Batch worker can copy data to destination periodically. It is most common way to replicate data to other destination. There are a lot of tools it supports ETL line Pentaho ETL, Spring Integration, Apache Camel.
If you provide me more detail scenario, i can help u in more detail
Enjoy
-Terry
I think mongoDB is a good choice. Here is why:
You can easily scale out, and thus be able to store tremendous amount of data. When using an according shard key, you might even be able to position the shards close to the exchange they follow in order to improve speed, if that should become a concern.
Replica sets offer automatic failover, which implicitly could be an issue
Using the TTL feature, data can be automatically deleted after their TTL, effectively creating a round robin database.
Both the aggregation and the map/reduce framework will be helpful
There are some free classes at MongoDB University which will prevent you to avoid the most common pitfalls
I have a number of systems, most of which are capable of generating data using JSON Activity Streams[1] (or can be coerced into doing so), and I want to use this data for analytics.
I want to use both a traditional SQL datamart for OLAP use, and also to dump the raw JSON data into Hadoop for running batch mapreduce jobs.
I've been reading up on Kafka, Flume, Scribe, S4, Storm and a whole load of other tools but I'm still not sure which is best suited to the task at hand. These seem to be either focussed on logfile data, or real-time processing of the activity stream, whereas I guess I'm more interested in doing ETL on activity streams.
The type of setup I'm thinking of is where I provide a configuration for all the streams I'm interested in (URLs, params, credentials), and the tool periodically polls them, dumps the output in HDFS, and also has a hook for me to process and transform the JSON for insertion into the datamart.
Do any of the existing open-source tools fit this case particularly well?
(In terms of scale I expect a max of 30,000 users interacting with ~10 systems - not simultaneously - so not really "Big Data", but not trivial either.)
Thanks!
[1] http://activitystrea.ms/
You should check out streamsets.com
It's an open source tool (and free to use) built exactly for these kinds of use cases.
You can use the HTTP Client source and HDFS destination to achieve your main goals. If you decide you need to use Kafka or Flume as well, support for both are also built-in.
You can also do transformations in a variety of ways including writing python or javascript for more complex transformations (or your own stage in Java if you choose).
You can also check out logstash (elastic.co) and NiFi to see if one of those works better for you.
*Full disclosure, I'm an engineer at StreamSets.