spark vs flink vs m/r for batch processing

spark vs flink vs m/r for batch processing - mysql

I was wondering if you can help me decide which one is best suitable to use in my case.
Use case:
I want to batch process ~200M events that are stored in apache kafka and ~20M rows in different sql tables per day. Data in rows represent states of users and events in kafka change these states. Events in kafka are well partitioned (all events for one user are stored in exactly one kafka segment),but still, there are more users than kafka segments.
(EDIT)
State updates can`t be handled in real-time as events come from different sources in different times. All events have timestamps with proper timezone, but events might be observed late which will yield in shifted timestamp. There are business rules how to handle these
I know to compute user state for any given time if all events and starting state is available.
Output:
consistent final user states are stored in mysql
writes during computation to other sources (kafka, text files, etc..) can occur based on current state
All of them are able to read and group data so I can process them, but as far as I know:
spark and flink can work withou hadoop (so far I don't have any stable cluster)
Spark has problem with dealing more data than RAM available (?)
with Flink I`m not sure if I can combine data from data stream (kafka) and table (sql)
with m/r I need to set up hadoop cluster
Also in the future there might be 100M events per hour and there will be functional hadoop cluster.

Related

How can we use streaming in spark from multiple source? e.g First take data from HDFS and then consume streaming from Kafka

The problem arise when I already have a system and I want to implement a Spark Streaming on top.
I have 50 million rows transactional data on MySQL, I want to do reporting on those data. I thought to dump the data into HDFS.
Now, Data are coming everyday also in DB and I am adding KAFKA for new data.
I want to know how can I combine multiple source data and do analytics in real-time (1-2 minutes delay is ok) and save those results because future data needs previous results.

Joins are possible in SparkSQL, but what happens when you need to update data in mysql? Then your HDFS data becomes invalid very quickly (faster than a few minutes, for sure). Tip: Spark can use JDBC rather than need HDFS exports.
Without knowing more about your systems, I say keep the mysql database running, as there is probably something else actively using it. If you want to use Kafka, then that's a continous feed of data, but HDFS/MySQL are not. Combining remote batch lookups with streams will be slow (could be more than few minutes).
However, if you use Debezium to get data into Kafka from mysql, then you then have data centralized in one location, and then ingest from Kafka into an indexable location such as Druid, Apache Pinot, Clickhouse, or maybe ksqlDB to ingest.
Query from those, as they are purpose built for that use case, and you don't need Spark. Pick one or more as they each support different use cases / query patterns.

Is it possible to merge two Debezium connectors?

Dealing with large MySQL DBs (10s TB), I find myself having to split up a connector so I can process more rows at a time (every single connector can process one table a time).
Once the initial sync is complete, and it switches to incremental, what is the cleanest way of merging the two connectors?
Is it even possible?

Since both connectors are created with different database.server.name values, their associated topics are likely prefixed with different values and so attempting to merge multiple connectors would be highly cumbersome and error-prone.
What I would suggest is if you have a large volume of data that you need to snapshot, don't rely on the initial snapshot phase to capture the data. Instead, configure a single connector to use the schema_only snapshot mode so that the schemas get captured before streaming starts. Then you can leverage incremenal snapshots that run in parallel with streaming to capture the 10TB of data concurrently.

Extract daily data changes from mysql and rollout to timeseries DB

In MySQL, using binlog, we can extract the data changes. But I need only the latest changes made during that time / day and need to feed those data into timeseries DB (planning to go with druid)
While reading binlog, Is there any mechanism to avoid duplicate and keep latest changes?
My intension is to get the entire MySQL DB backed up every day in a timeseries DB. It helps to debug my application for past dates by referring actual data present on that day

Kafka, by design, is append only log (no updates).
Kafka Connect source connector will continuously capture all the changes from binlog into the Kafka topic. Connector stores its position in binlog and will only write new changes into Kafka as they become available in MySLQ.
For consuming from Kafka, as one option, you can use sink connector that will write all the changes to your target. Or, instead of the Kafka Connect sink connector, some independent process that will read (consume) from Kafka. For Druid specifically, you may look https://www.confluent.io/hub/imply/druid-kafka-indexing-service.
Consumer (either connector or some independent process), will store its position (offset) in the Kafka topic, and will only write new changes into the target (Druid) as they become available in Kafka.
Processes described above captures all the changes and allows you to view source (MySQL) data at any point in time in target (Druid). It is the best practice to have all the changes available in the target. Use your target's functionality to limit view of data to certain time of the day, if needed.
If, for example, there are huge number of daily changes to a record in MySQL and you'd like to only write latest status as of specific time of the day to target. You'll still need to read all the changes from MySQL. Create some additional daily process that will read all the changes since prior run and filter only latest records and write them to target.

Strategies for mass insertion in MySQL

My app would be consuming data from multiple API's . This data can either be a single event or a batch of events. The data I am dealing with is click streams, where my app would run a cron job every minute to fetch data with our partners using their API's and eventually save everything to MySQL for detailed analysis. I am looking for ways to buffer this data somewhere, and then batch insert it to MySQL.
For example, say I receive a batch of 1000 click events with one API call, what data structures can I use to buffer it in Redis, and then eventually have a worker process to consume this data and insert to MySQL.
One simple approach would be to just fetch the data and store it to MySQL just like that. But since I am dealing with ad tech, where the size and the velocity of the data is always subject to change, this hardly seems like an approach to start with.
Oh, and the app would be built on top of Spring Boot and Tomcat.
Any help/discussion would be greatly appreciated. Thank you!

HIVE, HBASE which one I have to use for My Data Analytics

I have 150 GB of MySQL data, Plan to replace MySQL to Casandra as backend.
Analytics, plan to go with Hadoop, HIVE or HBASE.
Currently I have 4 physical machines for POC. Please some one help me to come up with best efficient architecture.
Per day I will get 5 GB of Data.
Daily Status report I have to send to each customer.
Have to give Analysis report based on request : for example : 1 week report or last month first 2 week report. Is it possible to produce report instantly using HIVe or HBASE ?
I want to give best performance using Cassandra, Hadoop .

Hadoop can process your data using map reduce paradigm or other, using emerging technologies such as Spark. The advantage is a reliable distributed filesystem and the usage of data locality to send the computation to the nodes that have the data.
Hive is a good SQL-like way of processing files and generate your reports once a day. It's batch processing and 5 more GB a day shouldn't produce a big impact. It has a high overhead latency though, but shouldn't be a problem if you do it once a day.
HBase and Cassandra are NoSQL databases whose purpose is to serve data with low latency. If that's a requirement, you should go with any of those. HBase uses the DFS to store the data and Cassandra has good connectors to Hadoop, so it's simple to run jobs consuming from these two sources.
For reports based on request, specifying a date range, you should store the data in an efficient way so you don't have to ingest data that's not needed for your report. Hive supports partitioning and that can be done using date (i.e. /<year>/<month>/<day>/). Using partitioning can significantly optimize your job execution times.
If you go to the NoSQL approach, be sure the rowkeys have some date format as prefix (e.g. 20140521...) so that you can select those that start by the dates you want.
Some questions you should also consider are:
how many data do you want to store in your cluster – e.g. last 180
days, etc. This will affect the number of nodes / disks. Beware data is usually replicated 3 times.
how many files do you have in HDFS – when the number of files is high,
Namenode will be hit hard on retrieval of file metadata. Some
solutions exist such as replicated namenode or using MapR Hadoop
distribution which doesn't rely on a Namenode per se.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008