Big data migration from Oracle to MySQL - mysql

I received over 100GB of data with 67million records from one of the retailers. My objective is to do some market-basket analysis and CLV. This data is a direct sql dump from one of the tables with 70 columns. I'm trying to find a way to extract information from this data as managing itself in a small laptop/desktop setup is becoming time consuming. I considered the following options
Parse the data and convert the same to CSV format. File size might come down to around 35-40GB as more than half of the information in each records is column names. However, I may still have to use a db as I cant use R or Excel with 66 million records.
Migrate the data to mysql db. Unfortunately I don't have the schema for the table and I'm trying to recreate the schema looking at the data. I may have to replace to_date() in the data dump to str_to_date() to match with MySQL format.
Are there any better way to handle this? All that I need to do is extract the data from the sql dump by running some queries. Hadoop etc. are options, but I dont have the infrastructure to setup a cluster. I'm considering mysql as I have storage space and some memory to spare.
Suppose I go in the MySQL path, how would I import the data? I'm considering one of the following
Use sed and replace to_date() with appropriate str_to_date() inline. Note that, I need to do this for a 100GB file. Then import the data using mysql CLI.
Write python/perl script that will read the file, convert the data and write to mysql directly.
What would be faster? Thank you for your help.

In my opinion writing a script will be faster, because you are going to skip the SED part.
I think that you need to setup a server on a separate PC, and run the script from your laptop.
Also use tail to faster get a part from the bottom of this large file, in order to test your script on that part before you run it on this 100GB file.

I decided to go with the MySQL path. I created the schema looking at the data (had to increase a few of the column size as there were unexpected variations in the data) and wrote a python script using MySQLdb module. Import completed in 4hr 40mins on my 2011 MacBook Pro with 8154 failures out of 67 million records. Those failures were mostly data issues. Both client and server are running on my MBP.
#kpopovbg, yes, writing script was faster. Thank you.

Related

Validation of migrated data for MySQL

I'm migrating a large(approx. 10GB) MySQL database(InnoDB engine).
I've figured out the migration part. Export -> mysqldump, Import -> mysql.
However, I'm trying to figure out the optimum way to validate if the migrated data is correct. I thought of the following approaches but they don't completely work for me.
One approach could have been using CHECKSUM TABLE. However, I can't use it since the target database would have data continuously written to it(from other sources) even during migration.
Another approach could have been using the combination of MD5(), GROUP_CONCAT, and CONCAT. However, that also won't work for me as some of the columns contain large JSON data.
So, what would be the best way to validate that the migrated data is correct?
Thanks.
How about this?
Do SELECT ... INTO OUTFILE from each old and new table, writing them into .csv files. Then run diff(1) between the files, eyeball the results, and convince yourself that the new tables' rows are an appropriate superset of the old tables'.
These flat files are modest in size compared to a whole database and diff is fast enough to be practical.

How to import Oracle sql file into MySQL

I have a .sql file from Oracle which contains create table/index statements and a lot of insert statements(around 1M insert).
I can manually modify the create table/index part(not too much), but for the insert part there are some Oracle functions like to_date.
I know MySql has a similar function STR_TO_DATE but the usage of the parameter is different.
I can connect to MySQL, but the .sql file is the only thing I got from Oracle.
Is there any way I can import this Oracle .sql file into MySQL?
Thanks.
Although the above job can be done by manually editing the script appropriately however there are products available which can be of use. Refer to the link for more information on one such product.
P.S. I am not affiliated in any way to the product
Since you mention about insert script basically i think you will be inserting data for this you can use any ETL tool, like open source tool like Pentaho data integrator, pretty simple to do, just search table to table transformation from different database connection on youtube to learn you should be able to connect to both mysql and oracle database else this wont help, but all the table structures you should create manually in the source database for data - you can just load it using ETL, no need to edit for every single line of insert if its more than 100 may be its very painful thing to do.

How to convert EXCEL to SQL (I have 143864 row and 100 column in excel) total 48,316 KB

I convert excel to csv first, then import to phpmyadmin only import 100 rows, I changed config.inc buffer size but still did not changed the result. Could you please help me ???
My main idea to do this, compare two tables on mysql workbench, I have one table already sql, i need excel to convert sql then i can use "compare schemas" creating EER Model of existing database.
Good you described the purpose of this approach. This way I can tell you in advance that it will not help to convert that Excel data to a MySQL table.
The model features (sync, compare etc.) all work on meta data only. They do not consider any table content. So instead you should do a textual comparison, by converting the table you have in the server to CSV.
Comparing such large documents is however a challenge. If you only have a few changes then using a diff tool (visual like Araxis Merge or diff on the command line) may help. For larger changesets a small utility app (may self written) might be necessary.

How to go about updating a MySQL Table from a CSV file every [time interval]?

Firstly, I understand that attempting to do this from MySQL itself is not allowed:
http://dev.mysql.com/doc/refman/5.6/en/stored-program-restrictions.html
When I try to use LOAD DATA INFILE 'c:/data.csv' ... , I get the "LOAD DATA IS NOT ALLOWED IN STORED PROCEDURES".
I am a beginner with moving data around MySQL and I realize this may not be a task it was designed to handle. Therefore, what approach should I use to grab data from a CSV file and append it to a table at a regular time interval? (I have researched a little bit about CRON, but that is for UNIX systems only and we are using a Windows based OS.)
You can run CRON job on windows also. I have found a couple of links after searching. Please look in to these links:
waytocode.com/2012/setup-cron-job-on-windows-server
http://stackoverflow.com/questions/24035090/run-cron-job-on-php-script-on-localhost-in-windows

How to import large mysql dumps into hadoop?

I need to import wikipedia dumps(mysql tables, unpacked files take about 50gb) into Hadoop(hbase). Now first I load dump into mysql and then transfer data from mysql to hadoop. But loading data into mysql takes huge amount of time - about 4-7 days. Is it possible to load mysql dump directly to hadoop(by means of some dump file parser or something similar)?
As far as I remember - MySQL Dumps are almost entirely is set of insert statements. You can parse them in your mapper and process as is... If you have only few tables hard code parsing in java should be trivial.
use sqoop . A tool that import mysql data into HDFS with map reduce jobs.
It is handy.