Data swap in a large database - mysql

I have 2.8 billion records in a table. The table uses INFOBRIGHT engine run on mysql installation. I have a few incorrect entries in the table and would like to get them corrected.
Table Test has 350 odd columns. I had like to swap data from column P1 to column P3 for a few records (not all). The approach that I had planned to carry to migrate data is as follows
Extract data from table Test to a CSV file using the INTO OUTFILE
functionality of mysql
Delete the unwanted records from the table.
Import the CSV data using the LOAD DATA INFILE and use the SET
clause to move the data from P1 to P3 (empty string for P1 and SET P1=P3)
The approach seemed to make sense, until I realized that INFOBRIGHT does not support the SET clause as mentioned here
Excerpts from the link below
The SET construct is supported by the MySQL loader found in the
standard MySQL download but not by the Infobright loader included in
ICE. I was able to actually execute a load using the SET statement;
what’s interesting is that it will run but the SET gets ignored by
Infobright.
Question
Is there a easier way to do this?
Of course, I can edit the CSV file. But for 2.8 billion records, I
would like to have a sure shot way of doing it. Any tested scripts
appreciated.
I would not want to use the mysql loader and load data in to MYISAM table, because
of the sheer size of data involved. Any faster approaches there?

Infobright don't allow SET while importing from file but you can choose mysql loader for loading the file. By default infobright use it's own loader but you can choose mysql loader to load file and then you can use SET.
You can set mysql loader using: set #bh_dataformat = 'mysql';
I don't know how much mysql loader will be slower than IB loader. I have loaded ~60GB file having ~60 columns in 1.5 hours.

Related

Validation of migrated data for MySQL

I'm migrating a large(approx. 10GB) MySQL database(InnoDB engine).
I've figured out the migration part. Export -> mysqldump, Import -> mysql.
However, I'm trying to figure out the optimum way to validate if the migrated data is correct. I thought of the following approaches but they don't completely work for me.
One approach could have been using CHECKSUM TABLE. However, I can't use it since the target database would have data continuously written to it(from other sources) even during migration.
Another approach could have been using the combination of MD5(), GROUP_CONCAT, and CONCAT. However, that also won't work for me as some of the columns contain large JSON data.
So, what would be the best way to validate that the migrated data is correct?
Thanks.
How about this?
Do SELECT ... INTO OUTFILE from each old and new table, writing them into .csv files. Then run diff(1) between the files, eyeball the results, and convince yourself that the new tables' rows are an appropriate superset of the old tables'.
These flat files are modest in size compared to a whole database and diff is fast enough to be practical.

import csv file with LOAD DATA LOCAL INFILE in symfony 1.4

I need to fill several of tables with CSV files. I tried to use a loop that do insert with each row but a file with 65,000 records take me more then 20 min.
I want to use the MySQL command LOAD DATA LOCAL INFILE, but I received this message :
LOAD DATA LOCAL INFILE forbidden in C:\xampp\htdocs\myProject\apps\backend\modules\member\actions\actions.class.php on line 112
After a little research, I understand there is need to change one of the security parameters of the PDO (PDO::MYSQL_ATTR_LOCAL_INFILE) to true.
In symfony2, you need to change it at config.yml of your app, but I can't find it on symfony 1.4.
Let me try to understand the question (or questions?!).
If you need to optimize the INSERT queries you should probably batch them at a single INSERT query or a few ones, but definitely not for each row. Besides, the INSERT query in MySQL will be always slow especially for a large amount of data inserted, also depends on indexing, engine and schema structure of the DB.
About the second question, take a look here, maybe it will help.

Big data migration from Oracle to MySQL

I received over 100GB of data with 67million records from one of the retailers. My objective is to do some market-basket analysis and CLV. This data is a direct sql dump from one of the tables with 70 columns. I'm trying to find a way to extract information from this data as managing itself in a small laptop/desktop setup is becoming time consuming. I considered the following options
Parse the data and convert the same to CSV format. File size might come down to around 35-40GB as more than half of the information in each records is column names. However, I may still have to use a db as I cant use R or Excel with 66 million records.
Migrate the data to mysql db. Unfortunately I don't have the schema for the table and I'm trying to recreate the schema looking at the data. I may have to replace to_date() in the data dump to str_to_date() to match with MySQL format.
Are there any better way to handle this? All that I need to do is extract the data from the sql dump by running some queries. Hadoop etc. are options, but I dont have the infrastructure to setup a cluster. I'm considering mysql as I have storage space and some memory to spare.
Suppose I go in the MySQL path, how would I import the data? I'm considering one of the following
Use sed and replace to_date() with appropriate str_to_date() inline. Note that, I need to do this for a 100GB file. Then import the data using mysql CLI.
Write python/perl script that will read the file, convert the data and write to mysql directly.
What would be faster? Thank you for your help.
In my opinion writing a script will be faster, because you are going to skip the SED part.
I think that you need to setup a server on a separate PC, and run the script from your laptop.
Also use tail to faster get a part from the bottom of this large file, in order to test your script on that part before you run it on this 100GB file.
I decided to go with the MySQL path. I created the schema looking at the data (had to increase a few of the column size as there were unexpected variations in the data) and wrote a python script using MySQLdb module. Import completed in 4hr 40mins on my 2011 MacBook Pro with 8154 failures out of 67 million records. Those failures were mostly data issues. Both client and server are running on my MBP.
#kpopovbg, yes, writing script was faster. Thank you.

java and mysql load data infile misunderstanding

Thanks for viewing this. I need a little bit of help for this project that I am working on with MySql.
For part of the project I need to load a few things into a MySql database which I have up and running.
The info that I need, for each column in the table Documentation, is stored into text files on my hard drive.
For example, one column in the documentation table is "ports" so I have a ports.txt file on my computer with a bunch of port numbers and so on.
I tried to run this mysql script through phpMyAdmin which was
LOAD DATA INFILE 'C:\\ports.txt" INTO TABLE `Documentation`(`ports`).
It ran successfully so I went to do the other load data i needed which was
LOAD DATA INFILE 'C:\\vlan.txt' INTO TABLE `Documentation` (`vlans`)
This also completed successfully, but it added all the rows to the vlan column AFTER the last entry to the port column.
Why did this happen? Is there anything I can do to fix this? Thanks
Why did this happen?
LOAD DATA inserts new rows into the specified table; it doesn't update existing rows.
Is there anything I can do to fix this?
It's important to understand that MySQL doesn't guarantee that tables will be kept in any particular order. So, after your first LOAD, the order in which the data were inserted may be lost & forgotten - therefore, one would typically relate such data prior to importing it (e.g. as columns of the same record within a single CSV file).
You could LOAD your data into temporary tables that each have an AUTO_INCREMENT column and hope that such auto-incremented identifiers remain aligned between the two tables (MySQL makes absolutely no guarantee of this, but in your case you should find that each record is numbered sequentially from 1); once there, you could perform a query along the following lines:
INSERT INTO Documentation SELECT port, vlan FROM t_Ports JOIN t_Vlan USING (id);

How to Update the mysql database when large csv file gets modified

Initially, I created a database called "sample" and updated the data from massive size CSV file.
Whenever I have small changes in .csv file (some data are added/deleted/modified), I have to update this in database too. Always updating the entire .csv file (large) is not efficient.
Is there any efficient way to update the modified data from .csv file to database?
There's no simple way of doing this.
One plausible way would be to store the old version of the CSV somewhere, run a diff-program between the old and new version, and then use the resulting output to determine what has been updated, changed, or removed, and update the database accordingly.
This is however a bit unreliable, slow, and would take some effort to implement. If you can it would probably be better to adapt the source of the CSV file to update the database directly.
since you also want to delete entries that are not existing in the csv file anymore, you will have to load the complete csv file every time (and truncate the table first) in order to get a 1:1 copy.
For a more convenient synchonzation you probably will have to utilize some scripting language (php, python etc).
Sorry thats all I know...
In my experience it's almost impossible to consider a datafile that changes regularly as your "master data set": unless you can somehow generate a diff file that shows where the masterdata changed you will always be forced to run through the entire csv file, query the database to return the corresponding record and then either do nothing (if identical), insert (if new) or update (if modified). In many cases it will even be faster to just drop the table and reload the entire thing, but that can lead to serious operational problems.
Therefor, if it's at all possible for you I'd consider the database as the masterdata, and generate the csv file from there.