Spring-Batch for a massive nightly / hourly Hive / MySQL data processing - mysql

I'm looking into replacing a bunch of Python ETL scripts that perform a nightly / hourly data summary and statistics gathering on a massive amount of data.
What I'd like to achieve is
Robustness - a failing job / step should be automatically restarted. In some cases I'd like to execute a recovery step instead.
The framework must be able to recover from crashes. I guess some persistence would be needed here.
Monitoring - I need to be able to monitor the progress of jobs / steps, and preferably see history and statistics with regards to the performance.
Traceability - I must be able to understand the state of the executions
Manual intervention - nice to have... being able to start / stop / pause a job from an API / UI / command line.
Simplicity - I prefer not to get angry looks from my colleagues when I introduce the replacement... Having a simple and easy to understand API is a requirement.
The current scripts do the following:
Collect text logs from many machines, and push them into Hadoop DFS. We may use Flume for this step in the future (see http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/).
Perform Hive summary queries on the data, and insert (overwrite) to new Hive tables / partitions.
Extract the new summaries data into files, and load (merge) into MySql tables. This is data needed later for on-line reports.
Perform additional joins on the newly added MySql data (from MySql tables), and update the data.
My idea is to replace the scripts with spring-batch. I also looked into Scriptella, but I believe it is too 'simple' for this case.
since I saw some bad vibes on Spring-Batch (mostly old posts) I'm hoping to get some inputs here. I also haven't seen much about spring-batch and Hive integration, which is troublesome.

If you want to stay within the Hadoop ecosystem, I'd highly recommend checking out Oozie to automate your workflow. We (Cloudera) provide a packaged version of Oozie that you can use to get started. See our recent blog post for more details.

Why not use JasperETL or Talend? Seems like the right tool for the job.

I've used Cascading quite a bit and found it be quite impressive:
Cascading
It is a M/R abstraction layer, and runs on Hadoop.

Related

Data warehousing (ETL): Scheduled data migration from MySQL to PostgreSQL

I've been assigned task of data warehousing for Reporting and data analysis. Let me first explain what we are going to do.
Step 1. Replicate production server MySQL database.
Step 2. Scheduled ETL: Read replicated database (MySQL) and push data to PostgreSQL.
Now I need your help on Step 2.
Note: I want saveOrUpdate functionality. If id is available then update it or save it. Data will be picked up based on modified date.
So is there any tool available for scheduled data push in PostgreSQL?, Considering my requirements.
If there ain't any tool available then which programming language I should use for ETL? And other pointers you can provide me to achieve this.
Asked same question https://dba.stackexchange.com/questions/203460/data-warehousing-etl-scheduled-data-migration-from-mysql-to-postgresql on dba.stackexchange.com but I guess it has low userbase so posting it here.
On aws you have DMS. I don't know if you can use it with external services but it's works pretty well.

Replicating data from mySQL to Hbase using flume: how?

I have a large mySQL database with heavy load and would like to replicate the data in this database to Hbase in order to do analytical work on it.
edit: I want the data to replicate relatively quickly, and without any schema changes (no timestamped rows, etc)
I've read that this can be done using flume, with mySQL as a source, possibly the mySQL bin logs, and Hbase as a sink, but haven't found any detail (high or low level). What are the major tasks to make this work?
Similar question were asked and answered earlier but didn't really explain how or point to resources that would:
Flume to migrate data from MySQL to Hadoop
Continuous data migration from mysql to Hbase
You are better off using SQOOP for this purpose, IMHO. It was developed for exactly this purpose. Flume was made for a rather different purpose, like aggregating log data, data generated from sensors etc.
See this for more details.
So far there are three options worth considering:
Sqoop: After initial bulk import, it supports two types of incremental udpates import: APPEND, LAST-MODFIED. But being said, It won't give you Real-Time or even near Real-Time replication. It's not because Sqoop can't run that fast, it's because you don't want to plug in a Sqoop pipe to your Mysql server and puling data every 1 or 2 mins.
Trigger: This is a quick-dirty solution, by adding triggers to the source RDBMS, and update your HBase according. This one gives you Real-Time satisfaction. But you have to mess up the source DB by adding triggers. It might be ok as a temporal solution, but long term, it just won't do.
Flume: This one, you will need to put in the most development effort. It doesn't need to touch the DB, it doesn't add in Reading traffic to the DB neither(It tails the transaction logs).
Personally I'd go for flume, not only it channels the data from RDBMS to your HBase, but also can you do something with the data while they are streaming through your flume pipe. (e.g. transformation, notification, alerting etc etc)

Fluentd+Mongo vs. Logstash

Our team now uses zabbix for monitoring and alert. In addition, we use fluent to gather log to an central mongoDB and it is put to work for a week. Recently we were discussing another solution - Logstash. I wanna ask which difference between them? In my opinion, I'd like use zabbix as a data-gathering and alert-sending platform and fluent plays the 'data-gathering' role in the whole infrastructure. While I've looked into Logstash website and found out that Logstash is not only a log-gathering system, but also a whole solutions for gathering, presentation and search.
Would anybody can give some advice or share some experience?
Logstash is pretty versatile (disclaimer: have only been playing with it for a few weeks).
We'd been looking at Graylog2 for a while (listening for syslog and providing a nice search UI) but the message processing functionality in it is based on the Drools engine and is.. arcane at best.
I found it was much easier to have logstash read syslog files from our central server, massage the events and output to Graylog2. Gave us much more flexibility and should allow us to add application level events alongside the OS level syslog data.
It has a zabbix output, so you might find it's worth a look.
Logstash is a great fit with Zabbix.
I forked a repo on github to take the logstash statsd output and send it to Zabbix for trending/alerting. As was mentioned by another, logstash also has a Zabbix output plugin which is great for notifying/sending on matching events.
Personally, I prefer the native Logstash->Elasticsearch backend to Logstash->Graylog2(->Elasticsearch).
It's easier to manage, especially if you have a large volume of log data. At present, Graylog2 uses Elasticsearch as well, but uses only a single index for all data. If you're periodically cleaning up old data, this means the equivalent of a lot of SQL "delete from table where date>YYYY.MM.DD" type-calls to clean out old data, where Logstash defaults to daily indexes (the equivalent of "drop table YYYY.MM.DD"), so clean-up is nicer.
It also results in cleaner searches, requiring less heap space, as you can search over a known date because the index is named for the day's data it contains.

What strategy/technology should I use for this kind of replication?

I am currently facing one problem which not yet figure out good solution, so hope to get some advice from you all.
My Problem as in the picture
Core Database is where all the clients connect to for managing live data which is really really big and busy all the time.
Feature Database is not used so often but it need some part of live data (maybe 5%) from the Core Database, But the request task to this server will take longer time and consume much resource.
What is my current solution:
I used database replication between Core Database & Feature Database, it works fine. But
the problem is that I waste a lot of disk space to store unwanted data.
(Filtering while replicate data is not work with my databases schema)
Using queueing system will not make data live on time as there are many request to Core Database.
Please suggest some idea if you have met this?
Thanks,
Pang
What you define is a classic data integration task. You can use any data integration tool to extract data from your core database and load into featured database. You can schedule your data integration jobs from real-time to any time-frame.
I used Talend in my mid-size (10GB) semi-scientific PostgreSQL database integration project. It worked beautifully.
You can also try SQL Server Integration Services (SSIS). This tool is very powerful as well. It works with all top-notch RDBMSs.
If all you're worrying about is disk space, I would stick with the solution you have right now. 100GB of disk space is less than a dollar, these days - for that money, you can't really afford to bring a new solution into the system.
Logically, there's also a case to be made for keeping the filtering in the same application - keeping the responsibility for knowing which records are relevant inside the app itself, rather than in some mysterious integration layer will reduce overall solution complexity. Only accept the additional complexity of a special integration layer if you really need to.

Replicating database changes

I want to "replicate" a database to an external service. For doing so I could just copy the entire database (SELECT * FROM TABLE).
If some changes are made (INSERT, UPDATE, DELETE), do I need to upload the entire database again or there is a log file describing these operations?
Thanks!
It sounds like your "external service" is not just another database, so traditional replication might not work for you. More details on that service would be great so we can customize answers. Depending on how long you have to get data to your external service and performance demands of your application, some main options would be:
Triggers: add INSERT/ UPDATE/ DELETE triggers
that update your external service's
data when your data changes (this
could be rough on your app's
performance but provide near
real-time data for your external
service)
Log Processing: you can parse changes from the logs and use some level of ETL to make sure they'll run properly on your external service's data storage. I wouldn't recommend getting into this if you're not familiar with their structure for your particular DBMS.
Incremental Diffs: you could run diffs on some interval (maybe 3x a day, for example) and have a cron job or scheduled task run a script that moves all the data in a big chunk. This prioritizes your app's performance over the external service.
If you choose triggers, you may be able to tweak an existing trigger-based replication solution to update your external service. I haven't used these so I have no idea how crazy that would be, just an idea. Some examples are Bucardo and Slony.
There are many ways to replicate a PostgreSQL database. In the current version 9.0 the PostgreSQL Global Development Group introduced two new rocks features called Hot Standby and Streaming Replication puting to PostgreSQL to a new level and introducing a built-in solution.
On the wiki, there is a completed review of the new PostgreSQL-9.0´s features:
http://wiki.postgresql.org/wiki/PostgreSQL_9.0
There are other applications like Bucardo, Slony-I, Londiste (Skytools), etc,which you can use too.
Now, What are you want to do for log processing? What do you want exactly ? regards