Importing 10 billion rows into mysql - mysql

I have a .csv file with 10 billion rows. I want to check that each row is unique. Is there an easy way to do this? I was thinking perhaps importing to mysql would allow me to find out uniqueness quickly. How can I upload this huge file to mysql? I have already tried row-by-row insert statements and also the 'LOAD DATA INFILE' command but both failed.
Thanks

I wouldn't use a database for this purpose, unless it needed to end up in the database eventually. Assuming you have the same formatting for each row (so that you don't have "8.230" and "8.23", or extra spaces on start/end of lines of equal values), use a few textutils included with most POSIX environments (Linux, Mac OS X), or available for Windows via GnuWIn32 coreutils.
Here is the sequence of steps to do from your system shell. First, sort the file (this step is required):
sort ten.csv > ten_sorted.csv
Then find unique rows from sorted data:
uniq ten_sorted.csv > ten_uniq.csv
Now you can check to see how many rows there are in the final file:
wc ten_uniq.csv
Or you can just use pipes for combining the three steps with one command line:
sort ten.csv | uniq | wc

Does the data have a unique identifier? Have this column as primary key in your mysql table and when you go to import the data, mysql should throw an error if you have duplicates.
As for how to go about doing it..just read in the file row by row and do an insert on each row.

If you are importing from Excel or such other programs. See here for how to cleanse the csv file before importing it into MySQL. Regarding the unique row, as long as your table schema is right, MySQL should be able to take care of it.
EDIT:
Whether the source is Excel or not, LOAD DATA LOCAL INFILE appears to be the way to go.
10bn rows, and LOAD DATA LOCAL gives you error? Are you sure there is no problem with the csv file?

You have to truncate your database into separate small bite size chunks. Use Big Dump.
http://www.ozerov.de/bigdump.php

If you do have 10 billion rows then you will struggle working with this data.
You would need to look at partitioning your database (ref here: about mysql partitioning)
However, even with that large number you would be requiring some serious hardware to cut through the work involved there.
Also, what would you do if a row was found to be nonunique? Would you want to continue importing the data? If you import the data would you import the identical row or flag it as a duplicate? Would you stop processing.

This is the kind of job linux is "made for".
First you have to split the file in to many smaller files:
split -l 100 filename
After this you have few options with the two commands sort / uniq, and after having timed 8 different options with a file of 1 million IP address from an ad exchange log-file, and found a almost 20x difference between using LC_ALL=C or not. For example:
LC_ALL=C sort IP_1m_rows.txt > temp_file
LC_ALL=C uniq temp_file > IP_unique_rows.txt
real 0m1.283s
user 0m1.121s
sys 0m0.088s
Where as the same without LC=ALL_C:
sort IP_1m_rows.txt > temp_file
uniq temp_file > IP_unique_rows.txt
real 0m24.596s
user 0m24.065s
sys 0m0.201s
Piping the command and using LC_ALL=C was 2x slower than the fastest:
LC_ALL=C sort IP_1m_rows.txt | uniq > IP_unique_rows.txt
real 0m3.532s
user 0m3.677s
sys 0m0.106s
Databases are not useful for one-off jobs like this, and flatfiles will get you surprisingly far even with more challenging / long-term objectives.

Related

Importing and exporting TSVs with MySQL

I'm using a database with MySQL 5.7, and sometimes, data needs to be updated using a mixture of scripts and manual editing. Because people working with the database are usually not familiar with SQL, I'd like to export the data as a TSV, which then could be manipulated (for example with Python's pandas module) and then be imported back. I assume the standard way would be to directly connect to the database, but using TSVs has some upsides in this situation, I think. I've been reading the MySQL docs and some stackoverflow questions to find the best way to do this. I've found a couple of solutions, however, they all are somewhat inconvenient. I will list them below and explain my problems with them.
My question is: did I miss something, for example some helpful SQL commands or CLI options to help with this? Or are the solutions I found already the best when importing/exporting TSVs?
My example database looks like this:
Database: Export_test
Table: Sample
Field
Type
Null
Key
id
int(11)
NO
PRI
text_data
text
NO
optional
int(11)
YES
time
timestamp
NO
Example data:
INSERT INTO `Sample` VALUES (1,'first line\\\nsecond line',NULL,'2022-02-16 20:17:38');
The data contains an escaped newline, which caused a lot of problems for me when exporting.
Table: Reference
Field
Type
Null
Key
id
int(11)
NO
PRI
foreign_key
int(11)
NO
MUL
Example data:
INSERT INTO `Reference` VALUES (1,1);
foreign_key is referencing a Sample.id.
Note about encoding: As a caveat for people trying to do the same thing: If you want to export/import data, make sure that characters sets and collations are set up correctly for connections. This caused me some headache, because although the data itself is utf8mb4, the client, server and connection character sets were latin1, which caused some loss of data in some instances.
Export
So, for exporting, I found basically three solutions, and they all behave somewhat differently:
A: SELECT stdout redirection
mysql Export_test -e "SELECT * FROM Sample;" > out.tsv
Output:
id text_data optional time
1 first line\\\nsecond line NULL 2022-02-16 21:26:13
Pros:
headers are added, which makes it easy to use with external programs
formatting works as intended
Cons:
NULL is used for null values; when importing, \N is required instead; as far as I know, this can't be configured for exports
Workaround: replace NULL values when editing the data
B: SELECT INTO OUTFILE
mysql Export_test -e "SELECT * FROM Sample INTO OUTFILE '/tmp/out.tsv';"
Output:
1 first line\\\
second line \N 2022-02-16 21:26:13
Pros:
\N is used for null data
Cons:
escaped linebreaks are not handled correctly
headers are missing
file writing permission issues
Workaround: fix linebreaks manually; add headers by hand or supply them in the script; use /tmp/ as output directory
C: mysqldump with --tab (performs SELECT INTO OUTFILE behind the scenes)
mysqldump --tab='/tmp/' --skip-tz-utc Export_test Sample
Output, pros and cons: same as export variant B
Something that should be noted: the output is only the same as B, if --skip-tz-utc is used; otherwise, timestamps will be converted to UTC, and will be off after importing the data.
Import
Something I didn't realize it first, is that it's impossible to merely update data directly with LOAD INTO or mysqlimport, although that's something many GUI tools appear to be doing and other people attempted. For me as an beginner, this wasn't immediately clear from the MySQL docs. A workaround appears to be creating an empty table, import the data there and then updating the actual table of interest via a join. I also thought one could update individual columns with this, which again is not possible. If there are some other ways to achieve this, I would really like to know.
As far as I could tell, there are two options, which do pretty much the same thing:
LOAD INTO:
mysql Export_test -e "SET FOREIGN_KEY_CHECKS = 0; LOAD DATA INFILE '/tmp/Sample.tsv' REPLACE INTO TABLE Sample IGNORE 1 LINES; SET FOREIGN_KEY_CHECKS = 1;"
mysqlimport (performs LOAD INTO behind the scenes):
mysqlimport --replace Export_test /tmp/Sample.tsv
Notice: if there are foreign key constraints like in this example, SET FOREIGN_KEY_CHECKS = 0; needs to be performed (as far as I can tell, mysqlimport can't be directly used in these cases). Also, IGNORE 1 LINES or --ignore-lines can be used to skip the first line if the input TSV contains a header. For mysqlimport, the name of the input file without extension must be the name of the table. Again, file reading permissions can be an issue, and /tmp/ is used to avoid that.
Are there ways to make this process more convenient? Like, are there some options I can use to avoid the manual workarounds, or are there ways to use TSV importing to UPDATE entries without creating a temporary table?
What I ended up doing was using LOAD INTO OUTFILE for exporting, added a header manually and also fixed the malformed lines by hand. After manipulating the data, I used LOAD DATA INTO to update the data. In another case, I exported with SELECT to stdout redirection, manipulated the data and then added a script, which just created a file with a bunch of UPDATE ... WHERE statements with the corresponding data. Then I ran the resulting .sql in my database. Is the latter maybe the best option in this case?
Exporting and importing is indeed sort of clunky in MySQL.
One problem is that it introduces a race condition. What if you export data to work on it, then someone modifies the data in the database, then you import your modified data, overwriting your friend's recent changes?
If you say, "no one is allowed to change data until you re-import the data," that could cause an unacceptably long time where clients are blocked, if the table is large.
The trend is that people want the database to minimize downtime, and ideally to have no downtime at all. Advancements in database tools are generally made with this priority in mind, not so much to accommodate your workflow of taking the data out of MySQL for transformations.
Also what if the database is large enough that the exported data causes a problem because where do you store a 500GB TSV file? Does pandas even work on such a large file?
What most people do is modify data while it remains in the database. They use in-place UPDATE statements to modify data. If they can't do this in one pass (there's a practical limit of 4GB for a binary log event, for example), then they UPDATE more modest-size subsets of rows, looping until they have transformed the data on all rows of a given table.

How do I import/handle large text files for MS SQL?

I have a 1.7GB txt file (about 1.5million rows) that is apparently formatted in some way for columns and rows, though I don't know the delimiter. I will need to be able to import this data into MySQL and MS SQL databases to run queries on.
I can't even open it in notepad to see a sample of the data.
For future reference, how does one handle and manipulate very large data files? What file format is best? To my knowledge Excel and CSV do not support unlimited numbers of rows.
You can use bcp in as below
bcp yourtable in C:\Data\yourfile.txt -c -t, -S localhost -T
Hence you know the column name from mysql, you can create table with that structure before hand in sql server

Interbase export to CSV

We are using Interbase XE3 (2013) with Delphi XE.
We need to export tables as CSV-File as they are imported into MsAccess for further controlling purposes.
For exporting we use the function of IBExpert and/or by our own program (Export as CSV-File via TClientDataSet from cxGrid).
The Problem:
Unfortunately the export is way to limited and we have no solution for exporting the whole table (in this case 400k rows, approx. 80 columns, and it is increasing every month).
I've read about FBExport, but didn't test it yet as I do not want to risk problems with our database. Did anyone test this tool yet?
Suitable Solution (to be found):
We need a way to export whole tables from Interbase XE3 into a CSV-File without a limitation in size/column/rows (that's why I think CSV is the way to go as there is no overhead). Also I would prefer a solution that can be executed via batch-file without the need for a person in front of a computer.
Kind regards
I've managed to answer this questions thanks to Arioch.
I have 2 parameters, "name of db" and "rows to fetch" (+optional divider).
My program determines the number of rows in a table (just Count(ID)) and divides it to pieces in "rows to fetch" (as I got out of memory before)
the export file is created at the beginning with the column names I got from my ibcquery as first line. Also the max. width of my multidimensional array is set by columncount of table. The length is the "rows to fetch".
Statement is like "SELECT * FROM TABLENAME ROWS X TO Y".
This is executed for every divided part of the table and written to my array. After filling the array the query gets closed and the data is appended to my CSV-File. I free the array and the next part gets loaded until the whole table is written to my file.
So the next limitation would be the file size, I think.
Thanks four your help!

how to FAST import a giant sql script for mysql?

Currently I have a situation which needs to import a giant sql script into mysql. The sql script content is mainly about INSERT operation. But there are so much records over there and the file size is around 80GB.
The machine has 8 cpus, 20GB mem. I have done something like:
mysql -h [*host_ip_address*] -u *username* -px xxxxxxx -D *databaseName* < giant.sql
But the whole process takes serveral days which is quite long.Is any other options to import the sql file into database?
Thanks so much.
I suggest you to try LOAD DATA INFILE. It is extremely fast. I've not used it for loading to remote server, but there is mysqlimport utility. See a comparison of different approaches: https://dev.mysql.com/doc/refman/5.5/en/insert-speed.html.
Also you need to convert your sql script to format suitable for LOAD DATA INFILE clause.
You can break the sql file into several files (on basis of tables) by using shell script & then prepare a shell script to one by one to import the file. This would speedy insert instead of one go.
The reason is that inserted records occupied space in memory for a single process and not remove. You can see when you are importing script after 5 hours the query execution speed would be slow.
Thanks for all your guys help.
I have taken some of your advices and done some comparison on this, now it is time to post the results. The target single sql script 15GB.
Overall, I tried:
importing data as single sql script with index; (Take Days, finally I killed it. DONOT TRY THIS YOURSELF, you will be pissed off.)
importing data as single sql script without index; (Same as above)
importing data as split sql script with index (Take the single sql as an example, I split the big file into small trunks around 41MB each. Each trunk takes around 2m19.586s, Total around );
importing data as split sql script without index; (Each trunk takes 2m9.326s.)
(Unfortunately I did not tried the Load Data method for this dataset)
Conclusion:
If you do not want to use Load Data method when you have to import a giant sql into mysql. It is better to:
Divide into small scripts;
Remove the index
You can add the index back after importing. Cheers
Thanks #btilly #Hitesh Mundra
Put the following commands at the head of giant.sql file
SET AUTOCOMMIT = 0;
SET FOREIGN_KEY_CHECKS=0;
and following at the end
SET FOREIGN_KEY_CHECKS = 1;
COMMIT;
SET AUTOCOMMIT = 1;

Speed up MySQL ?

I have a working project, but i want to tune it a bit
I use a batch: C:\xampp\mysql\bin\mysql --force -u user -ppass database1 < C:\some\import.sql
The file import.sql is a file that contains insert-statements. Each file is about 30MB of size, and it gets exponentally bigger. here is a data example (very basic)
ID | DATE | TEST
1 1-1-14 Y
2 1-2-14 Y
For each date, a few entries get added.
Yesterday 49999 Lines in the file
Today 50002 Lines in the file
So, i only really need 3 Lines of that file! And my batch "errors" on 49999 lines, that there is a duplicate line.
Is there any way to speed this up?
It's not possible to give advice on speeding up SQL without knowing what the SQL says. There are all kinds of complex factors to take into account. You could consider editing your question to show more details.
The fastest way to bulk-load data into a MySQL server is to use the LOAD DATA INFILE command. That command reads data from a flat file like a CSV or tab-separated value file directly into a table. It doesn't work with files of SQL commands.