Flume Spark Streaming into Mysql Database - mysql

I have Spark Single node and I'm going to stream data into mysql using Apache Flume -> Spark -> Mysql,
I used foreachPartition method but data insertion is getting slow and there is a queue in Spark UI.
Is there any suggestion to improve this data insertion. I need to process 3000 rows per second.

Please provide the configuration details for your query.
Are you using the bulkimport for mysql.

Related

Syncing data between MySQL and Node.js

In my project, I am storing a lot of data in MySQL automatically. I want, however, the same things to sync with Node.js application's own data storage. What would be the fastest and easiest way to do this when storing data in both storages simultaneously isn't possible?
So, for example, I am storing variable "balance" in MySQL inside one Node.js application. I would want this same balance updated into other Node.js application's own storage but my current Node.js app is not connected to socket or other kind of data transporting mechanism. So how could I fetch data from MySQL in that other Node.js application?
sound look like your project structure is saga pattern.
in your question about update data is can use Kafka to create a topic and 2 node application consume message on same topic to update data into own database.

How can we use streaming in spark from multiple source? e.g First take data from HDFS and then consume streaming from Kafka

The problem arise when I already have a system and I want to implement a Spark Streaming on top.
I have 50 million rows transactional data on MySQL, I want to do reporting on those data. I thought to dump the data into HDFS.
Now, Data are coming everyday also in DB and I am adding KAFKA for new data.
I want to know how can I combine multiple source data and do analytics in real-time (1-2 minutes delay is ok) and save those results because future data needs previous results.
Joins are possible in SparkSQL, but what happens when you need to update data in mysql? Then your HDFS data becomes invalid very quickly (faster than a few minutes, for sure). Tip: Spark can use JDBC rather than need HDFS exports.
Without knowing more about your systems, I say keep the mysql database running, as there is probably something else actively using it. If you want to use Kafka, then that's a continous feed of data, but HDFS/MySQL are not. Combining remote batch lookups with streams will be slow (could be more than few minutes).
However, if you use Debezium to get data into Kafka from mysql, then you then have data centralized in one location, and then ingest from Kafka into an indexable location such as Druid, Apache Pinot, Clickhouse, or maybe ksqlDB to ingest.
Query from those, as they are purpose built for that use case, and you don't need Spark. Pick one or more as they each support different use cases / query patterns.

Merge multi data sources to sink and keep up to date

My business has many tables in different MySQL instances, now we want to build a searching platform.
So we have to extract and join them to a width column table, insert them to elastic search, and keep them up to date in es.
In addition, We have an application that converts MySQL binlog to changing message and deliver them to Kafka.
Is there a suitable solution? Can Flink or Materialize help me?
have an application that converts MySQL binlog to changing message and deliver them to Kafka
That's a start.
You would then need to do the same with your "width column table", and try to put that into a Kafka KTable, which you can then join using Kafka Streams or ksqlDB, as one example. Flink may work as well here.
After you have a joined stream+table written into a new Kafka topic, you can use Kafka Connect Elastic connector, or again, try using Flink, Logstash, etc, to ingest into the search indicies.
One everything is running, any new inserts to either table will be reflected in Elastic.

Filtering rows of database

I would like to know if kafka platform is suitable for the following job.
I'm trying to ingest a full database with multiple tables. Once ingested by Kafka, I would like to filter rows of tables based on condition.
I think that is an easy job to do using Kafka streams, but what happen to messages that are rejected by the filter ?
Conditions could be met in the future if based on a date for example, so will there be a chance that a rejected message be filtered again to eventually pass the filter and be further processed ?
Is it better to filter the rows of data before feeding Kafka with it ?
Thank you.
You might want to consider using a database connector such as Debezium or the Confluent JDBC Source Connector which are both based on Kafka Connect
More on Debezium connector for MySQL see http://debezium.io/docs/connectors/mysql
More on Confluent JDBC Connector see http://docs.confluent.io/current/connect/connect-jdbc/docs/source_connector.html
With connectors based on Kafka Connect you can filter the rows of data before publishing to Kafka using the Single Message Transform (SMT) feature in Kafka Connect.
See discussion on Row Filtering with Kafka Connect here Kafka connect (Single message transform) row filtering

How can I increase big data performance?

I am new at this concept, and still learning. I have total 10 TB json files in AWS S3, 4 instances(m3.xlarge) in AWS EC2 (1 master, 3 worker). I am currently using spark with python on Apache Zeppelin.
I am reading files with the following command;
hcData=sqlContext.read.option("inferSchema","true").json(path)
In zeppelin interpreter settings:
master = yarn-client
spark.driver.memory = 10g
spark.executor.memory = 10g
spark.cores.max = 4
It takes 1 minute to read 1GB approximately. What can I do more for reading big data more efficiently?
Should I do more on coding?
Should I increase instances?
Should I use another notebook platform?
Thank you.
For performance issue, the best is to know where is the performance bottleneck. Or try to see where the performance problem could be.
Since 1 minute to read 1GB is pretty slow. I would try the following steps.
Try to explicitly specify schema instead of inferSchema
Try to use Spark 2.0 instead of 1.6
Check the connection between S3 and EC2, in case there were some misconfiguration
Using different file format like parquet other than json
Increase the executor memory and decrease the driver memory
Use Scala instead of Python, although in this case is the least likely the issue.
I gave a talk on this topic back in october: Spark and Object Stores
Essentially: use parquet/orc but tune settings for efficient reads. Once it ships, grab Spark 2.0.x built against Hadoop 2.8 for lots of speedup work we've done, especially working with ORC & Parquet. We also add lots of metrics too, though not yet pulling them all back in to the spark UI.
Schema inference can be slow, if it has to work through the entire dataset (CSV inference does; I don't know about JSON). I'd recommend doing it once, grabbing the schema details and then explicitly declaring it as the schema next time wround.
You can persist the data in parquet format after json read
hcData=sqlContext.read.option("inferSchema","true").json(path)
hcData.write.parquet("hcDataFile.parquet")
val hcDataDF = spark.read.parquet("hcDataFile.parquet")
// create a temporary view in Spark 2.0 or registerAsTemp table in Spark 1.6 and use SQL for further logic
hcDataDF.createOrReplaceTempView("T_hcDataDF")
//This is a manual way of doing RDD checkingpointing (not supported for DataFrames), this will reduce RDD Lineage which will improve performance.
For execution, use Dyanamic Resource Allocation for your spark-submit command:
//Make sure the following are enabled in your cluster, otherwise you can use these parameters in spark-summit command as --conf
• spark.dynamicAllocation.enabled=true
• spark.dynamicAllocation.initialExecutors=5
• spark.dynamicAllocation.minExecutors=5
• spark.shuffle.service.enabled=true
• yarn.nodemanager.aux-services=mapreduce_shuffle,spark_shuffle
• yarn.nodemanager.aux-services.spark_shuffle.class
=org.apache.spark.network.yarn.YarnShuffleService
//Spark-submit command
./bin/spark-submit --class package.hcDataclass \
--master yarn-cluster \
--deploy-mode cluster \
--driver-memory 1G \
--executor-memory 5G\
hcData*.jar
//For dynamic Resource Allocation we don't need to specify the # of executors. //Job will automatically get the resources based on cluster bandwidth.