Validation of migrated data for MySQL - mysql

I'm migrating a large(approx. 10GB) MySQL database(InnoDB engine).
I've figured out the migration part. Export -> mysqldump, Import -> mysql.
However, I'm trying to figure out the optimum way to validate if the migrated data is correct. I thought of the following approaches but they don't completely work for me.
One approach could have been using CHECKSUM TABLE. However, I can't use it since the target database would have data continuously written to it(from other sources) even during migration.
Another approach could have been using the combination of MD5(), GROUP_CONCAT, and CONCAT. However, that also won't work for me as some of the columns contain large JSON data.
So, what would be the best way to validate that the migrated data is correct?
Thanks.

How about this?
Do SELECT ... INTO OUTFILE from each old and new table, writing them into .csv files. Then run diff(1) between the files, eyeball the results, and convince yourself that the new tables' rows are an appropriate superset of the old tables'.
These flat files are modest in size compared to a whole database and diff is fast enough to be practical.

Related

MySQL: Automate Data Ingestion from regular txt/csv files to a Database

Intro
I've searched all around about this problem, but I didn't really found a source of knowledge about this, so I'm sorry if this problem seems basic to you, but for me is rather quite intriguing due the fact that I'm having hard time to guess what keywords to use on google in order to retrieve proper info.
Problem Description :
As a matter of fact, i have to issues that i don't know how to deal in a MySQL instance installed in a laptop in a windows environment:
I have a DB in MySQL with 50 tables, of with 15 or 20 tables are tables with original data. The other tables were tables that i generated from the original data tables, in order to properly create tables that would allow me to analyze data in PowerBI. The original data tables are fed by dumps from a ERP Database.
My issue is the following:
How would one automate the process of receiving cumulative txt/csv files (via pen-drive or any other transfer mechanism), store those files into a folder and then update the existing tables with the new information? Is there any reference of best practices to deal with such a scenario?
How can i maintain the good shape of my database with the successive data integration, I mean, how can I make my database scalable and responsive?
Can you point me some sources that would help me with this?
At the moment I imported data into tables, in 2 steps:
1st - I created the table structure with the Workbench import wizard help ( I had to do it this way because the tables have a lot of fields - dozens of them, literally, and those fields need to be in the database). I also inserted primary keys and indexes in those tables;
2nd - I Managed to load the data from the files into those tables, using LOAD DATA IN FILE command.
Some of the fields of the tables created with the import wizard, were created as data type text, with is not necessary in this scenario. I would like to revert those fields to data type NVARCHAR(255) or something, However there are a lot of field to alter the data type and in multiple tables at this point, and i was wondering if i can write a query to do the job of creating all the ALTER TABLES statements i need.
So my issue here is: is it safe to alter the data type in multiple fields in multiple columns (in this case i would like to change fields with datatype text to NAVARCHAR(255))? What is the best way to do this? Can you point me to some sources or best practices for this, please?
Thank you, in advance, for your help.
Cheers
You need a scripting language, not a UI. See mysql commandline tool, the shell of your OS, etc, etc.
DROP DATABASE and reCREATE it
LOAD DATA
Massage the data to get the columns cleaner than what the load data provided
Sic the BI tool on the data.
If you want to discuss Step 3, we need details about what transformations are needed between step 2 and step 4. That includes providing the format or schema for steps 2 and 4.

Big data migration from Oracle to MySQL

I received over 100GB of data with 67million records from one of the retailers. My objective is to do some market-basket analysis and CLV. This data is a direct sql dump from one of the tables with 70 columns. I'm trying to find a way to extract information from this data as managing itself in a small laptop/desktop setup is becoming time consuming. I considered the following options
Parse the data and convert the same to CSV format. File size might come down to around 35-40GB as more than half of the information in each records is column names. However, I may still have to use a db as I cant use R or Excel with 66 million records.
Migrate the data to mysql db. Unfortunately I don't have the schema for the table and I'm trying to recreate the schema looking at the data. I may have to replace to_date() in the data dump to str_to_date() to match with MySQL format.
Are there any better way to handle this? All that I need to do is extract the data from the sql dump by running some queries. Hadoop etc. are options, but I dont have the infrastructure to setup a cluster. I'm considering mysql as I have storage space and some memory to spare.
Suppose I go in the MySQL path, how would I import the data? I'm considering one of the following
Use sed and replace to_date() with appropriate str_to_date() inline. Note that, I need to do this for a 100GB file. Then import the data using mysql CLI.
Write python/perl script that will read the file, convert the data and write to mysql directly.
What would be faster? Thank you for your help.
In my opinion writing a script will be faster, because you are going to skip the SED part.
I think that you need to setup a server on a separate PC, and run the script from your laptop.
Also use tail to faster get a part from the bottom of this large file, in order to test your script on that part before you run it on this 100GB file.
I decided to go with the MySQL path. I created the schema looking at the data (had to increase a few of the column size as there were unexpected variations in the data) and wrote a python script using MySQLdb module. Import completed in 4hr 40mins on my 2011 MacBook Pro with 8154 failures out of 67 million records. Those failures were mostly data issues. Both client and server are running on my MBP.
#kpopovbg, yes, writing script was faster. Thank you.

How to move data from one SQLite to MySQL with different designs?

The problem is:
I've got a SQLite database which is constantly being updated though a proprietary application.
I'm building an application which uses MySQL and the database design is very different from the one of SQLite.
I then have to copy data from SQLite to MySQL but it should be done very carefully as not everything should be moved, tables and fields have different names and sometimes data from one table goes to two tables (or the opposite).
In short, SQLite should behave as a client to MySQL inserting what is new and updating the old in an automated way. It doesn't need to be updating in real time; every X hours would be enough.
A google search gave me this:
http://migratedb.sourceforge.net/
And asking a friend I got information about the Multisource plugin (Squirrel SQL) in this page:
http://squirrel-sql.sourceforge.net/index.php?page=plugins
I would like to know if there is a better way to solve the problem or if I will have to make a custom script myself.
Thank you!
I recommend a custom script for this:
If it's not a one-to-one conversion between the tables and fields, tools might not help there. In your question, you've said:
...and sometimes data from one table goes to two tables (or the opposite).
If you only want the differences, then you'll need to build the logic for that unless every record in the SQLite db has timestamps.
Are you going to be updating the MySQL db at all? If not, are you okay to completely delete the MySQL db and refresh it every X hours with all the data from SQLite?
Also, if you are comfortable with a scripting language (like php, python, perl, ruby, etc.), they have API's for both SQLite and MySQL; it would be easy enough to build your own script which you can control customise more easily based on program logic. Especially if you want to run "conversions" between the data from one to the other and not just simple mapping.
I hope i understand you correctly, that you will flush the data which are stored in a SQLite DB periodicly to a MySQL DB. Right?
So this is how i would do it.
Create a Cron, which starts the script every x minutes.
Export the Data from SQLite into an CSV-File.
Do an LOAD DATA INFILE an import the CSV Data to MySQL
Code example for LOAD DATA INFILE
LOAD DATA INFILE 'PATH_TO_EXPORTED_CSV' REPLACE INTO TABLE your_table FIELDS TERMINATED BY ';' ENCLOSED BY '\"' LINES TERMINATED BY '\\n' IGNORE 1 LINES ( #value_column1, #unimportend_value, #value_column2, #unimportend_value, #unimportend_value, #value_column3) SET diff_mysql_column1 = #value_column1, diff_mysql_column2 = #value_column2, diff_mysql_column3 = #value_column3);
This Code you can query to as much db tables you want. Also you can change the variables #value_column1.
Im in a hurry. so thats it for now. ask if something is unclear.
Greets Michael

Combine several mssql database to one mysql with php

We are handling a data aggregation project by having several microsoft sql server databases combining to one mysql database. all mssql database have the same schema.
The requirements are :
each mssql database can be imported to mysql independently
before being able to import each record to mysql we need to validates each records with a specific createrias via php.
each imported mssql database can be rollbacked. It means even it already imported to mysql, all the mssql database can be removed from the mysql.
we would still like to know where does each record imported to the mysql come from what mssql database.
All import process will be done with PHP .
we have difficulty in many aspects. we don't know what is the best approach to solve our problem.
your help will be highly appreciated.
ps: each mssql database has around 60 tables and each table can have a few hundred thousands .
Don't use PHP as a database administration utility. Any time you build a quick PHP script to transfer records directly from one database to another, you're going to cause yourself a world of hurt when that script becomes required for production operation.
You have a number of problems that you need solved:
You have multiple MSSQL databases with similar if not identical tables.
You have a single MySQL database that you want to merge the data into.
The imported data must be altered in a specific way before being merged.
You want to prevent all duplicate records in your import.
You want to know what database each record originally came from.
The solution?
Analyze the source MSSQL databases and create a merge strategy for them.
Create a database structure on the MySQL database that fits the merge strategy in #1, including all the new key constraints (like unique and foreign keys) required for the consolidation.
At this point you have two options left:
Dump the data from each of the source databases into raw data using your RDBMS administration utility of choice. Alter that data to fit your merge strategy and constraints. Document this, and then merge all of the data into your new database structure.
Use a tool like opendbcopy to map columns from one database to another and run a mass import.
Hope this helps.

MySQL to SQL Server transferring data

I need to convert data that already exists in a MySQL database, to a SQL Server database.
The caveat here is that the old database was poorly designed, but the new one is in a proper 3N form. Does any one have any tips on how to go about doing this? I have SSMS 2005.
Can I use this to connect to the MySQL DB and create a DTS? Or do I need to use SSIS?
Do I need to script out the MySQL DB and alter every statement to "insert" into the SQL Server DB?
Has anyone gone through this before? Please HELP!!!
See this link. The idea is to add your MySQL database as a linked server in SQL Server via the MySQL ODBC driver. Then you can perform any operations you like on the MySQL database via SSMS, including copying data into SQL Server.
Congrats on moving up in the RDBMS world!
SSIS is designed to do this kind of thing. The first step is to map out manually where each piece of data will go in the new structure. So your old table had four fields, in your new structure fileds1 and 2 go to table a and field three and four go to table b, but you also need to have the autogenerated id from table a. Make notes as to where data types have changed and you may need to make adjustments or where you have required fileds where the data was not required before etc.
What I usually do is create staging tables. Put the data in the denormalized form in one staging table and then move to normalized staging tables and do the clean up there and add the new ids as soon as you have them to the staging tables. One thing you will need to do if you are moving from a denormalized database to a normalized one is that you will need to eliminate the duplicates from the parent tables before inserting them into the actual production tables. You may also need to do dataclean up as there may be required fileds in the new structure that were not required in the old or data converstion issues becasue of moving to better datatypes (for instance if you stored dates in the old database in varchar fields but properly move to datetime in the new db, you may have some records which don't have valid dates.
ANother issue you need to think about is how you will convert from the old record ids to the new ones.
This is not a an easy task, but it is doable if you take your time and work methodically. Now is not the time to try shortcuts.
What you need is an ETL (extract, transform, load) tool.
http://en.wikipedia.org/wiki/Extract,_transform,_load#Tools
I don't really know how far an 'ETL' tool will get you depending on the original and new database designs. In my career I've had to do more than a few data migrations and we usually always had to design a special utility which would update a fresh database with records from the old database, and yes we coded it complete with all the update/insert statements that would transform data.
I don't know how many tables your database has, but if they are not too many then you could consider going the grunt root. That's one technique that's guaranteed to work after all.
If you go to your database in SSMS and right-click, under tasks should be an option for "Import Data". You can try to use that. It's basically just a wizard that creates an SSIS package for you, which it can then either run for you automatically or which you can save and then alter as needed.
The big issue is how you need to transform the data. This goes into a lot of specifics which you don't include (and which are probably too numerous for you to include here anyway).
I'm certain that SSIS can handle whatever transformations you need to do to change it from the old format to the new. An alternative though would be to just import the tables into MS SQL as-is into staging tables, then use SQL code to transform the data into the 3NF tables. It's all a matter of what your most comfortable with. If you go the second route, then the import process that I mentioned above in SSMS could be used. It will even create the destination tables for you. Just be sure that you give them unique names, maybe prefixing them with "STG_" or something.
Davud mentioned linked servers. That's definitely another way that you can go (and got my upvote). Personally, I prefer to copy the tables over into MS SQL first since linked servers can sometimes have weirdness, especially when it comes to data types not mapping between different providers. Having the tables all in MS SQL will also probably be a bit faster and saves time if you have to rerun or correct portions of the data. As I said though, the linked server method would probably be fine too.
I have done this going the other direction and SSIS works fine, although I might have needed to use a script task to deal with slight data type weirdness. SSIS does ETL.