I have a working project, but i want to tune it a bit
I use a batch: C:\xampp\mysql\bin\mysql --force -u user -ppass database1 < C:\some\import.sql
The file import.sql is a file that contains insert-statements. Each file is about 30MB of size, and it gets exponentally bigger. here is a data example (very basic)
ID | DATE | TEST
1 1-1-14 Y
2 1-2-14 Y
For each date, a few entries get added.
Yesterday 49999 Lines in the file
Today 50002 Lines in the file
So, i only really need 3 Lines of that file! And my batch "errors" on 49999 lines, that there is a duplicate line.
Is there any way to speed this up?
It's not possible to give advice on speeding up SQL without knowing what the SQL says. There are all kinds of complex factors to take into account. You could consider editing your question to show more details.
The fastest way to bulk-load data into a MySQL server is to use the LOAD DATA INFILE command. That command reads data from a flat file like a CSV or tab-separated value file directly into a table. It doesn't work with files of SQL commands.
Related
I have a 1.7GB txt file (about 1.5million rows) that is apparently formatted in some way for columns and rows, though I don't know the delimiter. I will need to be able to import this data into MySQL and MS SQL databases to run queries on.
I can't even open it in notepad to see a sample of the data.
For future reference, how does one handle and manipulate very large data files? What file format is best? To my knowledge Excel and CSV do not support unlimited numbers of rows.
You can use bcp in as below
bcp yourtable in C:\Data\yourfile.txt -c -t, -S localhost -T
Hence you know the column name from mysql, you can create table with that structure before hand in sql server
I know how to import a text file into MySQL database by using the command
LOAD DATA LOCAL INFILE '/home/admin/Desktop/data.txt' INTO TABLE data
The above command will write the records of the file "data.txt" into the MySQL database table. My question is that I want to erase the records form the .txt file once it is stored in the database.
For Example: If there are 10 records and at current point of time 4 of them have been written into the database table, I require that in the data.txt file these 4 records get erased simultaneously. (In a way the text file acts as a "Queue".) How can I accomplish this? Can a java code be written? Or a scripting language is to be used?
Automating this is not too difficult, but it is also not trivial. You'll need something (a program, a script, ...) that can
Read the records from the original file,
Check if they were inserted, and, if they were not, copy them in another file
Rename or delete the original file, and rename the new file to replace the original one.
There might be better ways of achieving what you want to do, but, that's not something I can comment on without knowing your goal.
Currently I have a situation which needs to import a giant sql script into mysql. The sql script content is mainly about INSERT operation. But there are so much records over there and the file size is around 80GB.
The machine has 8 cpus, 20GB mem. I have done something like:
mysql -h [*host_ip_address*] -u *username* -px xxxxxxx -D *databaseName* < giant.sql
But the whole process takes serveral days which is quite long.Is any other options to import the sql file into database?
Thanks so much.
I suggest you to try LOAD DATA INFILE. It is extremely fast. I've not used it for loading to remote server, but there is mysqlimport utility. See a comparison of different approaches: https://dev.mysql.com/doc/refman/5.5/en/insert-speed.html.
Also you need to convert your sql script to format suitable for LOAD DATA INFILE clause.
You can break the sql file into several files (on basis of tables) by using shell script & then prepare a shell script to one by one to import the file. This would speedy insert instead of one go.
The reason is that inserted records occupied space in memory for a single process and not remove. You can see when you are importing script after 5 hours the query execution speed would be slow.
Thanks for all your guys help.
I have taken some of your advices and done some comparison on this, now it is time to post the results. The target single sql script 15GB.
Overall, I tried:
importing data as single sql script with index; (Take Days, finally I killed it. DONOT TRY THIS YOURSELF, you will be pissed off.)
importing data as single sql script without index; (Same as above)
importing data as split sql script with index (Take the single sql as an example, I split the big file into small trunks around 41MB each. Each trunk takes around 2m19.586s, Total around );
importing data as split sql script without index; (Each trunk takes 2m9.326s.)
(Unfortunately I did not tried the Load Data method for this dataset)
Conclusion:
If you do not want to use Load Data method when you have to import a giant sql into mysql. It is better to:
Divide into small scripts;
Remove the index
You can add the index back after importing. Cheers
Thanks #btilly #Hitesh Mundra
Put the following commands at the head of giant.sql file
SET AUTOCOMMIT = 0;
SET FOREIGN_KEY_CHECKS=0;
and following at the end
SET FOREIGN_KEY_CHECKS = 1;
COMMIT;
SET AUTOCOMMIT = 1;
I am using MySQL.
I got a mysql dump file (large_data.sql), I can create a database and load data from this dump file to the created database. No problem on this.
Now, I feel the data in the dump file is too large (for example, it contains 300000 rows/objects in one table, other tables are also contain a large amount of data).
So, I decided to make another dump (based on the large size dump) which can contains a small size of data (for example, 30 rows/objects in a table).
With only that big size dump file, what is the correct and efficient way to cut off the data in that dump and create a new dump file which contains small amount of data?
------------------------- More -----------------------------------
(Use textual tool to open the large size dump is not good, since the dump is very large, it takes long time to open the dump from textual tool)
If you want to work only on the textual dump files, you could use some textual tools (like awk or sed, or perhaps a perl or python or ocaml script) to handle them.
But maybe your big database was already loaded from the big dump file, and you want to work with MySQL incremental backups?
I recommend free file splitter : http://www.filesplitter.org/ .
Only problem : it cut a query in two parts. You need to edit manualy the file after but, it work like a charm.
Example :
My file is :
BlaBloBluBlw
BlaBloBluBlw
BlaBloBluBlw
Result will be :
File 1:
BlaBloBluBlw
BlaBloBl
File 2:
uBlw
BlaBloBluBlw
So you need to edit everything but it work like a charm and very quick. Used today on a 9,5 millions rows table.
BUT !! Best argument : the time you will take to do this is small compared to the time you try to import something big or waiting for it... this is quick and efficent even if you need to edit the file manualy since you need to rebuild the last and first query.
I have a .csv file with 10 billion rows. I want to check that each row is unique. Is there an easy way to do this? I was thinking perhaps importing to mysql would allow me to find out uniqueness quickly. How can I upload this huge file to mysql? I have already tried row-by-row insert statements and also the 'LOAD DATA INFILE' command but both failed.
Thanks
I wouldn't use a database for this purpose, unless it needed to end up in the database eventually. Assuming you have the same formatting for each row (so that you don't have "8.230" and "8.23", or extra spaces on start/end of lines of equal values), use a few textutils included with most POSIX environments (Linux, Mac OS X), or available for Windows via GnuWIn32 coreutils.
Here is the sequence of steps to do from your system shell. First, sort the file (this step is required):
sort ten.csv > ten_sorted.csv
Then find unique rows from sorted data:
uniq ten_sorted.csv > ten_uniq.csv
Now you can check to see how many rows there are in the final file:
wc ten_uniq.csv
Or you can just use pipes for combining the three steps with one command line:
sort ten.csv | uniq | wc
Does the data have a unique identifier? Have this column as primary key in your mysql table and when you go to import the data, mysql should throw an error if you have duplicates.
As for how to go about doing it..just read in the file row by row and do an insert on each row.
If you are importing from Excel or such other programs. See here for how to cleanse the csv file before importing it into MySQL. Regarding the unique row, as long as your table schema is right, MySQL should be able to take care of it.
EDIT:
Whether the source is Excel or not, LOAD DATA LOCAL INFILE appears to be the way to go.
10bn rows, and LOAD DATA LOCAL gives you error? Are you sure there is no problem with the csv file?
You have to truncate your database into separate small bite size chunks. Use Big Dump.
http://www.ozerov.de/bigdump.php
If you do have 10 billion rows then you will struggle working with this data.
You would need to look at partitioning your database (ref here: about mysql partitioning)
However, even with that large number you would be requiring some serious hardware to cut through the work involved there.
Also, what would you do if a row was found to be nonunique? Would you want to continue importing the data? If you import the data would you import the identical row or flag it as a duplicate? Would you stop processing.
This is the kind of job linux is "made for".
First you have to split the file in to many smaller files:
split -l 100 filename
After this you have few options with the two commands sort / uniq, and after having timed 8 different options with a file of 1 million IP address from an ad exchange log-file, and found a almost 20x difference between using LC_ALL=C or not. For example:
LC_ALL=C sort IP_1m_rows.txt > temp_file
LC_ALL=C uniq temp_file > IP_unique_rows.txt
real 0m1.283s
user 0m1.121s
sys 0m0.088s
Where as the same without LC=ALL_C:
sort IP_1m_rows.txt > temp_file
uniq temp_file > IP_unique_rows.txt
real 0m24.596s
user 0m24.065s
sys 0m0.201s
Piping the command and using LC_ALL=C was 2x slower than the fastest:
LC_ALL=C sort IP_1m_rows.txt | uniq > IP_unique_rows.txt
real 0m3.532s
user 0m3.677s
sys 0m0.106s
Databases are not useful for one-off jobs like this, and flatfiles will get you surprisingly far even with more challenging / long-term objectives.