How to use flume to read data from mysql? - mysql

How to use flume to read continously from mysql to load to hbase?
I am familiar with sqoop but I need to continuously do it from a mysql source.
Is it required to have custom source to do this?

Sqoop is good for bulk import from RDBMS to HDFS/Hive/HBase. If it's just one time import, it's very good, it does what it promises on paper. But the problem comes when you wanna have real-time incremental updates. Between two types of incremental updates Sqoop supports:
Append, this one allows you to re-run the sqoop job, and every new job starts where the last old job ends. eg. first sqoop job only imported row 0-100, then next job will start from 101 based on --last-value=100. But even if 0-100 got updated, Append mode won't cover them anymore.
last-modified, this one is even worse IMHO, it requires the source table has a timestamp field which indicates when the row gets last updated. Then based on the timestamp, it does the incremental updates import. If the source table doesn't have anything like that, this one is not useful.
I'd say, if you do have hands on your source db, you can go for the last-modified mode using Sqoop.

There are a number of ways to do this, but I would write a script that takes your data from MySQL and generates an Avro event for each.
You can then use the built-in Avro source to receive this data, sending it to the HDFS sink.

Related

SSIS package design, where 3rd party data is replacing existing data

I have created many SSIS packages in the past, though the need for this one is a bit different than the others which I have written.
Here's the quick description of the business need:
We have a small database on our end sourced from a 3rd party vendor, and this needs to be overwritten nightly.
The source of this data is a bunch of flat files (CSV) from the 3rd party vendor.
Current setup: we truncate the tables of this database, and we then insert the new data from the files, all via SSIS.
Problem: There are times when the files fail to come, and what happens is that we truncate the old data, though we don't have the fresh data set. This leaves us without a database where we would prefer to have yesterday's data over no data at all.
Desired Solution: I would like some sort of mechanism to see if the new data truly exists (these files) prior to truncating our current data.
What I have tried: I tried to capture the data from the files and add them to an ADO recordset and only proceeding if this part was successful. This doesn't seem to work for me, as I have all the data capture activities in one data flow and I don't see a way for me to reuse that data. It would seem wasteful of resources for me to do that and let the in-memory tables just sit there.
What have you done in a similar situation?
If files are not present update some flags like IsFile1Found to false and pass these flags to stored procedure which truncates on conditional basis.
If file is empty then Using powershell through Execute Process Task you can extract first two rows if there are two rows (header + data row) then it means data file is not empty. Then you can truncate the table and import the data.
other approach could be
you can load data into some staging table and from these staging table insert data to the destination table using SQL stored procedure and truncate these staging tables after data is moved to all the destination table. In this way before truncating destination table you can check if staging tables are empty or not.
I looked around and found that some others were struggling with the same issue, though none of them had a very elegant solution, nor do I.
What I ended up doing was to create a flat file connection to each file of interest and have a task count records and save to a variable. If a file isn't there, the package fails and you can stop execution at that point. There are some of these files whose actual count is interesting to me, though for the most part, I don't care. If you don't care what the counts are, you can keep recycling the same variable; this will reduce the creation of variables on your end (I needed 31). In order to preserve resources (read: reduce package execution time), I excluded all but one of the columns in each data source; it made a tremendous difference.

Sybase to MySQL automatic exportation

I have two databases: Sybase and MySQL. I need to export records to MySql when these are inserted in Sybase or export in some scheduled event.
I've tried with output statement but this can not be used in triggers or procedures.
Any suggestion to solve this problem?
(disclaimer, I've done similar things previously, but by no means would I consider the answer below the state of the art - just one possible approach
google around something like 'cross-database replication' or 'cross rdbms replication' to see who's done this before.
).
I would first of all see if you can't score an ETL tool do the job without too much work. There are free open source ones and even things like Microsoft SSIS might work on non-MS databases.
If not, I would split this into different steps.
Find an appropriate Sybase output command that exports a subset of rows from one or more tables. By subset I mean you need to be able to add a WHERE clause, not just do a full table dump.
Use an appropriate MySQL import script/command to load the data gotten out of step #1. You may need to cycle back and forth between the 2 till you have something that works manually.
Write a Sybase trigger to insert lookup keys into a to-export table. You want to store at least the tablename & source Sybase table's keys for each inserted row. Use column names like key1_char, key2_char, not the actual column names, that makes it easier to extend to other source tables as needed. keep trigger processing as light as possible. What about updates btw?
Write a scheduled batch on Sybase side to run step #1 for the rows flagged in #3.
Write a scheduled batch on Mysql to import ,via #2, the results of #4. Or kick it off from #4.
Another approach is to do the #3 flagging bit as needed, but use to drive one scheduled batch that SELECTs data from Sybase and INSERTs it into mysql directly.
You'll have to pick up the data from Sybase's SELECT and bind it manually to the INSERT of mysql. But you probably get finer control over whats going on and you don't have to juggle 2 batches. That's what I think a clever ETL would already be doing on your behalf. Any half clever scripting language like php, python or ruby ought to handle it easily. Especially important if you have things like surrogate/auto-generated keys.
Keep in mind that in both cases you'll have to either delete the to-export rows that you've successfully inserted or flag them as done.

Big data migration from Oracle to MySQL

I received over 100GB of data with 67million records from one of the retailers. My objective is to do some market-basket analysis and CLV. This data is a direct sql dump from one of the tables with 70 columns. I'm trying to find a way to extract information from this data as managing itself in a small laptop/desktop setup is becoming time consuming. I considered the following options
Parse the data and convert the same to CSV format. File size might come down to around 35-40GB as more than half of the information in each records is column names. However, I may still have to use a db as I cant use R or Excel with 66 million records.
Migrate the data to mysql db. Unfortunately I don't have the schema for the table and I'm trying to recreate the schema looking at the data. I may have to replace to_date() in the data dump to str_to_date() to match with MySQL format.
Are there any better way to handle this? All that I need to do is extract the data from the sql dump by running some queries. Hadoop etc. are options, but I dont have the infrastructure to setup a cluster. I'm considering mysql as I have storage space and some memory to spare.
Suppose I go in the MySQL path, how would I import the data? I'm considering one of the following
Use sed and replace to_date() with appropriate str_to_date() inline. Note that, I need to do this for a 100GB file. Then import the data using mysql CLI.
Write python/perl script that will read the file, convert the data and write to mysql directly.
What would be faster? Thank you for your help.
In my opinion writing a script will be faster, because you are going to skip the SED part.
I think that you need to setup a server on a separate PC, and run the script from your laptop.
Also use tail to faster get a part from the bottom of this large file, in order to test your script on that part before you run it on this 100GB file.
I decided to go with the MySQL path. I created the schema looking at the data (had to increase a few of the column size as there were unexpected variations in the data) and wrote a python script using MySQLdb module. Import completed in 4hr 40mins on my 2011 MacBook Pro with 8154 failures out of 67 million records. Those failures were mostly data issues. Both client and server are running on my MBP.
#kpopovbg, yes, writing script was faster. Thank you.

Validating the CSV data before inserting into SQL Server

I have a senario where i have to check the first column of the excel has a valid data and not in the other columns from the CSV file. If the data is not present in the first column my SSIS package should log an exeption.
Can any one help me in this senario please.
Thanks,
Sateesh.
In SSIS you can use a conditional Split task to do this and send the good data to where you want it and the bad data to an exception table.
Personally I always prefer to start with putting the data for any import into two tables, one with the raw unchanged data and one that will contain the cleaned up data before the import to the prod tables. This makes it easier to see what the cause is when you inevitably have to research why some bad data got into the database (if you are doing your job right, 90+% of the time it's bad data you were sent _ you can't know the contract expires on 4/12/2012 when you were sent 4/12/2011 to pick a not so random example). Also always make sure to save the input file to an archive loaction. Trust me, you will need one or more of those archived files some day.

Does any database support revision control on its command line interface?

Does any database support revision control on its command line interface?
So for instance, I'm at the mysql> command prompt. I'm going to add a column to a table. I type
ALTER TABLE X ADD COLUMN Y bigint;
and then the database prompts me:
"OK. Check these changes in using git?"
I respond yes, and it prints out the revision number for the schema.
(No need for a mysqldump to get the schema, removing the data, and checkin at the bash command line.)
At another time, I decide to check in all of my data as well. So I type something like
CHECKIN SNAPSHOT TABLE X
Seems like a common sense approach to me, but has anyone modified a database to support such behavior?
Thanks
Well, for one thing, checking in one table makes no sense; its a relational database for a reason; you can't just roll back one table to a previous state and expect that to be OK.
Instead, you can roll back the entire database to a previous state. The database world calls this "point in time recovery"; and you can do it in MySQL with a combination of a full backup and replication logs.
I have no idea why you'd want to store your backups in git.
What you're looking I believe are "migrations". These are usually done at the code level rather than at the database level. Rails supports migrations, Django does with South, I'm sure there's others out there as well.
These migrations are just code files (either in the programming language of choice, or sql scripts) that can be checked in.
The usual case is Schema migrations that don't migrate data but some (most?) allow you to write custom code to do so if you wish.
Generally speaking you wouldn't want to version control data, but there are exceptions when some of the data is part of the logic, but you generally want to separate the type of data that you consider part of the functionality and the parts that you don't.
If you want to store everything, that's just called backing up. Those generally don't do into version control just because of the size. Either it's in a binary format that doesn't diff well, or it's many times larger than it needs to be in a text format. If you need to diff the data, there are tools for that as well.