400MB of csv data into mysql in under 2 hours - mysql

I need to import 400MB of data into a mysql table from 10 different .txt files loosely formatted in a csv manner. I have 4 hours tops to do it.
Is this AT ALL possible?
Is it better to chop the files into smaller files?
Is it better to upload them all simultaneously or to upload them sequentialy?
Thanks
EDIT:
I currently uploaded some of the files to the server, and am trying to use the "load infile" from phpmyadmin to achieve the import, using the following syntax & parameters:
load data infile 'http://example.com/datafile.csv' replace into table database.table fields terminated by ',' lines terminated by ';' (`id`,`status`);
It throws an Access denied error, was wondering if I could achieve this through a php file instead, as mentioned here.
Thanks again
FINAL EDIT
(or My Rookie Mistakes)
In the end, it could be done. Easily.
Paranoia was winning the war in my mind as I wrote this, due to a time limit I thought impossible. I wasn't reading the right things, paying attention to nothing else but the pressure.
The first mistake, an easy one to make and solve, was keeping the indexes while trying to import that first batch of data. It was pointed out in the comments, and I read it in many other places, but I dismissed it as unimportant; it wouldn't change a thing. Well, it definitely does.
Still, the BIG mistake was using LOAD FILE on a table running on the InnoDB engine. Suddenly I run into a post somewhere which's link I've lost that proclaimed that the command wouldn't perform well in InnoDB tables, giving errors such as Lock wait and the likes.
I'm guessing this should probably be declared as a duplicate for some InnoDB vs MyISAM question (if somewhat related), or someone could perhaps provide a more elaborate answer explaining what I only mention and know superficially, and I'd gladly select it as the correct answer (B-B-BONUS points for adding relative size of index compared to table data or something of the like).
Thanks to all those who where involved.
PS: I've replaced HeidiSQL with SQLyog since I read the recommendation on the comments and later answer, it's pretty decent and a bit faster/lighter. If I can overcome the SQLServer-y interface I might keep it as my default db manager.

Related

Compare 2 large sql files and find differences to recover data

I have 2 large SQL files of around 8GB each. But, in the latest backup I find that one file has 300MB data missing.
I just want to compare which data is missing, so that I can check that was it just temporary data OR important data that has vanished.
On comparing both files via diff on Ubuntu 14.04 I always get memory allocation error. I have also tried other allowing more than memory solutions and all that, but still no help.
I want to gather all data which exists in sql1 but missing in sql2 to a new file sql3.
Please help!
EDIT: I moved from Simple MySQl-Server to Percona XtraDB Cluster recently, A lot of tables were converted from MyISAM to INNODB in the process. So, can that be a reason for 300MB decreases in mysqldump SQL files? I seriously doubt this because a SQL will be an SQL, but does INNODB SQL code is decreased in any case? An expert advice on this will help.
SQL dumps comparison is quite hard to do when dealing with large amounts of data. I would try the following:
Import each SQL file data into its own database
Use one of the methods indicated here to compare database content (I assume the schema is the same). E.g. Toad for MySql
This way of comparison should be faster, as data manipulation is much faster when stored into database and also has the advantage the missing data can easily used. E.g.
SELECT *
FROM db1.sometable
WHERE NOT EXISTS (SELECT 1
FROM db2.sometable
WHERE db1.sometable.pkcol = db2.sometable.pk2)
will return the exact missing information into a convenient way.
If you export the dump you can use tools like Beyond Compare, Semantic Merge, Winmerge, Code Compare or other diff tools.
Not that some tools (i.e. Beyond Compare) have 4096 characters limit for a row, which becomes a problem in the comparison (I got mad). It's possible to change that in Tools->FileFormat->[choose your format, maybe it is EverythingElse]->Conversion->64000 characters Per Line (this is the maximum).
Also you can try changing the fileformat to SQL(might not help much though; and it will slow your comparison).

Fastest way to copy a large MySQL table?

What's the best way to copy a large MySQL table in terms of speed and memory use?
Option 1. Using PHP, select X rows from old table and insert them into the new table. Proceed to next iteration of select/insert until all entries are copied over.
Option 2. Use MySQL INSERT INTO ... SELECT without row limits.
Option 3. Use MySQL INSERT INTO ... SELECT with a limited number of rows copied over per run.
EDIT: I am not going to use mysqldump. The purpose of my question is to find the best way to write a database conversion program. Some tables have changed, some have not. I need to automate the entire copy over / conversion procedure without worrying about manually dumping any tables. So it would be helpful if you could answer which of the above options is best.
There is a program that was written specifically for this task called mysqldump.
mysqldump is a great tool in terms of simplicity and careful handling of all types of data, but it is not as fast as load data infile
If you're copying on the same database, I like this version of Option 2:
a) CREATE TABLE foo_new LIKE foo;
b) INSERT INTO foo_new SELECT * FROM foo;
I've got lots of tables with hundreds of millions of rows (like 1/2B) AND InnoDB AND several keys AND constraints. They take many many hours to read from a MySQL dump, but only an hour or so by load data infile. It is correct that copying the raw files with the DB offline is even faster. It is also correct that non-ASCII characters, binary data, and NULLs need to be handled carefully in CSV (or tab-delimited files), but fortunately, I've pretty much got numbers and text :-). I might take the time to see how long the above steps a) and b) take, but I think they are slower than the load data infile... which is probably because of transactions.
Off the three options listed above.
I would select the second option if you have a Unique constraint on at least one column, therefore not creating duplicate rows if the script has to be run multiple times to achieve its task in the event of server timeouts.
Otherwise your third option would be the way to go, while manually taking into account any server timeouts to determine your insert select limits.
Use a stored procedure
Option two must be fastest, but it's gonna be a mighty long transaction. You should look into making a stored procedure doing the copy. That way you could offload some of the data parsing/handling from the MySQL engine.
MySQL's load data query is faster than almost anything else, however it requires exporting each table to a CSV file.
Pay particular attention to escape characters and representing NULL values/binary data/etc in the CSV to avoid data loss.
If possible, the fastest way will be to take the database offline and simply copy data files on disk.
Of course, this have some requirements:
you can stop the database while copying.
you are using a storage engine that stores each table in individual files, MyISAM does this.
you have privileged access to the database server (root login or similar)
Ah, I see you have edited your post, then I think this DBA-from-hell approach is not an option... but still, it's fast!
The best way i find so far is creating the files as dump files(.txt), by using the outfile to a text then using infile in mysql to get the same data to the database

Exporting Large MySql Table

I have a table in MySql that I manage using PhpMyAdmin. Currently it's sitting at around 960,000 rows.
I have a boss who likes to look at the data in Excel, which means weekly, I have to export the data into Excel.
I am looking for a more efficient way to do this. Since I can't actually do the entire table at once, because it times out. So I have been stuck with 'chunking' the table into smaller queries and exporting it like that.
I have tried connecting Excel (and Access) directly to my database, but same problem; it times out. Is there any way to extend the connection limit?
Is your boss Rain Man? Does he just spot 'information' in raw 'data'?
Or does he build functions in excel rather than ask for what he really needs?
Sit with him for an hour and see what he's actually doing with the data. What questions are being asked? What patterns are being detected (manually)? Write a real tool to find that information and then plug it into your monitoring/alerting system.
Or, get a new boss. Seriously. You can tell him I said so.
Honestly for this size of data, I would suggest doing a mysqldump then importing the table back into another copy of MySQL installed somewhere else, maybe on a virtual machine dedicated to this task. From there, you can set timeouts and such as high as you want and not worry about resource limitations blowing up your production database. Using nice on Unix-based OSes or process priorities on Windows-based system, you should be able to do this without too much impact to the production system.
Alternatively, you can set up a replica of your production database and pull the data from there. Having a so-called "reporting database" that replicates various tables or even entire databases from your production system is actually a fairly common practice in large environments to ensure that you don't accidentally kill your production database pulling numbers for someone. As an added advantage, you don't have to wait for a mysqldump backup to complete before you start pulling data for your boss; you can do it right away.
The rows are not too many. Run a query to export your data into a csv file:
SELECT column,column2 INTO OUTFILE '/path/to/file/result.txt'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
FROM yourtable WHERE columnx = 'çondition';
One option would be to use SELECT ... INTO OUTFILE or mysqldump to export your table to CSV, which Excel can then directly open.

How to update database of ~25,000 music files?

Update:
I wrote a working script that finishes this job in a reasonable length of time, and seems to be quite reliable. It's coded entirely in PHP and is built around the array_diff() idea suggested by saccharine (so, thanks saccharine!).
You can access the source code here: http://pastebin.com/ddeiiEET
I have a MySQL database that is an index of mp3 files in a certain directory, together with their attributes (ie. title/artist/album).
New files are often being added to the music directory. At the moment it contains about 25,000 MP3 files, but I need to create a cron job that goes through it each day or so, adding any files that it doesn't find in the database.
The problem is that I don't know what is the best / least taxing way of doing this. I'm assuming a MySQL query would have to be run for each file on each cron run (to check if it's already indexed), so the script would unavoidably take a little while to run (which is okay; it's an automated process). However, because of this, my usual language of choice (PHP) would probably not suffice, as it is not designed to run long-running scripts like this (or is it...?).
It would obviously be nice, but I'm not fussed about deleting index entries for deleted files (if files actually get deleted, it's always manual cleaning up, and I don't mind just going into the database by hand to fix the index).
By the way, it would be recursive; the files are mostly situated in an Artist/Album/Title.mp3 structure, however they aren't religiously ordered like this and the script would certainly have to be able to fetch ID3 tags for new files. In fact, ideally, I would like the script to fetch ID3 tags for each file on every run, and either add a new row to the database or update the existing one if it had changed.
Anyway, I'm starting from the ground up with this, so the most basic advice first I guess (such as which programming language to use - I'm willing to learn a new one if necessary). Thanks a lot!
First a dumb question, would it not be possible to simply order the files by date added and only run the iterations through the files added in the last day? I'm not very familiar working with files, but it seems like it should be possible.
If all you want to do is improve the speed of your current code, I would recommend that you check that your data is properly indexed. It makes queries a lot faster if you search through a table's index. If you're searching through columns that aren't the key, you might want to change your setup. You should also avoid using "SELECT *" and instead use "SELECT COUNT" as mysql will then be returning ints instead of objects.
You can also do everything in a few mysql queries but will increase the complexity of your php code. Call the array with information about all the files $files. Select the data from the db where the files in the db match the a file in $files. Something like this.
"SELECT id FROM MUSIC WHERE id IN ($files)"
Read the returned array and label it $db_files. Then find all files in $files array that don't appear in $db_files array using array_diff(). Label the missing files $missing_files. Then insert the files in $missing_files into the db.
What kind of Engine are you using? If you're using MyISAM, the whole table will be locked while updating your table. But still, 25k rows are not that much, so basically in (max) a few minutes it should be updated. If it is InnoDB just update it since it's row-level locked and you should be still able to use your table while updating it.
By the way, if you're not using any fulltext search on that table, I believe that you should convert it to InnoDB as you can use foreign indexes, and that would help you a lot while joining tables. Also, it scales better AFAIK.

mysql optimization script file

I'm looking at having someone do some optimization on a database. If I gave them a similar version of the db with different data, could they create a script file to run all the optimizations on my database (ie create indexes, etc) without them ever seeing or touching the actual database? I'm looking at MySQL but would be open to other db's if necessary. Thanks for any suggestions.
EDIT:
What if it were an identical copy with transformed data? Along with a couple sample queries that approximated what the db was used for (ie OLAP vs OLTP)? Would a script be able to contain everything or would they need hands on access to the actual db?
EDIT 2:
Could I create a copy of the db, transform the data to make it unrecognizable, create a backup file of the db, give it to vendor and them give me a script file to run on my db?
Why are you concerned that they should not access the database? You will get better optimization if they have the actual data as they can consider table sizes, which queries run the slowest, whether to denormalise if necessary, putting small tables completely in memory, ...?
If it is a issue of confidentiality you can always make the data anomous by replacement of names.
If it's just adding indices, then yes. However, there are a number of things to consider when "optimizing". Which are the slowest queries in your database? How large are certain tables? How can certain things be changed/migrated to make those certain queries run faster? It could be harder to see this with sparse sample data. You might also include a query log so that this person could see how you're using the tables/what you're trying to get out of them, and how long those operations take.