I'm using a database with MySQL 5.7, and sometimes, data needs to be updated using a mixture of scripts and manual editing. Because people working with the database are usually not familiar with SQL, I'd like to export the data as a TSV, which then could be manipulated (for example with Python's pandas module) and then be imported back. I assume the standard way would be to directly connect to the database, but using TSVs has some upsides in this situation, I think. I've been reading the MySQL docs and some stackoverflow questions to find the best way to do this. I've found a couple of solutions, however, they all are somewhat inconvenient. I will list them below and explain my problems with them.
My question is: did I miss something, for example some helpful SQL commands or CLI options to help with this? Or are the solutions I found already the best when importing/exporting TSVs?
My example database looks like this:
Database: Export_test
Table: Sample
Field
Type
Null
Key
id
int(11)
NO
PRI
text_data
text
NO
optional
int(11)
YES
time
timestamp
NO
Example data:
INSERT INTO `Sample` VALUES (1,'first line\\\nsecond line',NULL,'2022-02-16 20:17:38');
The data contains an escaped newline, which caused a lot of problems for me when exporting.
Table: Reference
Field
Type
Null
Key
id
int(11)
NO
PRI
foreign_key
int(11)
NO
MUL
Example data:
INSERT INTO `Reference` VALUES (1,1);
foreign_key is referencing a Sample.id.
Note about encoding: As a caveat for people trying to do the same thing: If you want to export/import data, make sure that characters sets and collations are set up correctly for connections. This caused me some headache, because although the data itself is utf8mb4, the client, server and connection character sets were latin1, which caused some loss of data in some instances.
Export
So, for exporting, I found basically three solutions, and they all behave somewhat differently:
A: SELECT stdout redirection
mysql Export_test -e "SELECT * FROM Sample;" > out.tsv
Output:
id text_data optional time
1 first line\\\nsecond line NULL 2022-02-16 21:26:13
Pros:
headers are added, which makes it easy to use with external programs
formatting works as intended
Cons:
NULL is used for null values; when importing, \N is required instead; as far as I know, this can't be configured for exports
Workaround: replace NULL values when editing the data
B: SELECT INTO OUTFILE
mysql Export_test -e "SELECT * FROM Sample INTO OUTFILE '/tmp/out.tsv';"
Output:
1 first line\\\
second line \N 2022-02-16 21:26:13
Pros:
\N is used for null data
Cons:
escaped linebreaks are not handled correctly
headers are missing
file writing permission issues
Workaround: fix linebreaks manually; add headers by hand or supply them in the script; use /tmp/ as output directory
C: mysqldump with --tab (performs SELECT INTO OUTFILE behind the scenes)
mysqldump --tab='/tmp/' --skip-tz-utc Export_test Sample
Output, pros and cons: same as export variant B
Something that should be noted: the output is only the same as B, if --skip-tz-utc is used; otherwise, timestamps will be converted to UTC, and will be off after importing the data.
Import
Something I didn't realize it first, is that it's impossible to merely update data directly with LOAD INTO or mysqlimport, although that's something many GUI tools appear to be doing and other people attempted. For me as an beginner, this wasn't immediately clear from the MySQL docs. A workaround appears to be creating an empty table, import the data there and then updating the actual table of interest via a join. I also thought one could update individual columns with this, which again is not possible. If there are some other ways to achieve this, I would really like to know.
As far as I could tell, there are two options, which do pretty much the same thing:
LOAD INTO:
mysql Export_test -e "SET FOREIGN_KEY_CHECKS = 0; LOAD DATA INFILE '/tmp/Sample.tsv' REPLACE INTO TABLE Sample IGNORE 1 LINES; SET FOREIGN_KEY_CHECKS = 1;"
mysqlimport (performs LOAD INTO behind the scenes):
mysqlimport --replace Export_test /tmp/Sample.tsv
Notice: if there are foreign key constraints like in this example, SET FOREIGN_KEY_CHECKS = 0; needs to be performed (as far as I can tell, mysqlimport can't be directly used in these cases). Also, IGNORE 1 LINES or --ignore-lines can be used to skip the first line if the input TSV contains a header. For mysqlimport, the name of the input file without extension must be the name of the table. Again, file reading permissions can be an issue, and /tmp/ is used to avoid that.
Are there ways to make this process more convenient? Like, are there some options I can use to avoid the manual workarounds, or are there ways to use TSV importing to UPDATE entries without creating a temporary table?
What I ended up doing was using LOAD INTO OUTFILE for exporting, added a header manually and also fixed the malformed lines by hand. After manipulating the data, I used LOAD DATA INTO to update the data. In another case, I exported with SELECT to stdout redirection, manipulated the data and then added a script, which just created a file with a bunch of UPDATE ... WHERE statements with the corresponding data. Then I ran the resulting .sql in my database. Is the latter maybe the best option in this case?
Exporting and importing is indeed sort of clunky in MySQL.
One problem is that it introduces a race condition. What if you export data to work on it, then someone modifies the data in the database, then you import your modified data, overwriting your friend's recent changes?
If you say, "no one is allowed to change data until you re-import the data," that could cause an unacceptably long time where clients are blocked, if the table is large.
The trend is that people want the database to minimize downtime, and ideally to have no downtime at all. Advancements in database tools are generally made with this priority in mind, not so much to accommodate your workflow of taking the data out of MySQL for transformations.
Also what if the database is large enough that the exported data causes a problem because where do you store a 500GB TSV file? Does pandas even work on such a large file?
What most people do is modify data while it remains in the database. They use in-place UPDATE statements to modify data. If they can't do this in one pass (there's a practical limit of 4GB for a binary log event, for example), then they UPDATE more modest-size subsets of rows, looping until they have transformed the data on all rows of a given table.
Related
I've been searching for a quick way to do this after my first few thoughts have failed me, but I haven't found anything.
My Issue
I'm importing raw client data into an Access database where the flat file they provide is parsed and converted into a standardized format for our organization. I do this for all of our clients, but this particular client's software gives us a file that puts "(NULL)" in every field that should be NULL. lol as a result, I have a ton of strings rather than a null field!
My goal is to do a data cleanse of the entire TABLE, rather than perform the cleanse at the FIELD level (as I do in my temporary solution below).
Data Cleanse
Temporary Solution:
I can't add those strings to our datawarehouse, so for now, I just have a query with an IIF statement check that replaces "(NULL)" with "" for each field (which took awhile to setup since the client file has roughly 96 fields). This works. However, we work with hundreds of clients, so I'd like to make a scale-able solution that doesn't require many changes if another client has a similar file; not to mention that if this client changes something in their file, I might have to redo my field specific statements.
Long-term Solution:
My first thought was an UPDATE query. I was hoping I could do something like:
UPDATE [ImportedRaw_T]
SET [ImportedRaw_T].* = ""
WHERE ((([ImportedRaw_T].* = "(NULL)"));
This would be easily scale-able, since for further clients I'd only need to change the table name and replace "(NULL)" with their particular default. Unfortunately, you can't use SELECT * with an update query.
Can anyone think of a work-around to the SELECT * issue for the update query, or have a better solution for cleansing an entire table, rather doing the cleanse at the field level?
SIDE NOTES
This conversion is 100% automated currently (Access is called via a watch folder batch), so anything requiring manual data manipulation / human intervention is out.
I've tried using a batch script to just cleanse the data in the .txt file before importing to Access - however, this caused an issue with the fixed-width format of the .txt, which has caused even larger issues with the automatic import of the file to Access. So I'd prefer to do this in Access if possible.
Any thoughts and suggestions are greatly appreciated. Thanks!
Unfortunately it's impossible to implement this in SQL using wildcards instead of column names, there is no such kind syntax.
I would suggest VBA solution, where you need to cycle thru all table fields and if field data type is string, generate and execute SQL UPDATE command for updating current field.
Also use Null instead of "", if you really need Nulls in the field instead of empty strings, they may work differently in calculations.
From the docs, it states:
The CSV storage engine stores data in text files using comma-separated
values format.
What are the advantages of this? Here are some I can think of:
You can edit the CSV files using simple text editor (however, you can export data easily using SELECT INTO OUTFILE)
Can be easily imported into Spreadsheet programs
Lightweight and maybe better performance (wild guess)
What are some disadvantages?
No indexing
Cannot be partitioned
No transactions
Cannot have NULL values
Granted this (non-exhaustive) list of advantages and disadvantages, in what practical scenarios should I consider using the CSV storage engine over others?
I seldom use the CSV storage engine. One scenario I have found it useful, however, is for bulk data imports.
Create a table with columns matching my input CSV file.
Outside of mysql, just using a shell prompt, mv the CSV file into the MySQL data dictionary, overwriting the .csv file that belongs to my table I just created.
ALTER TABLE mytable ENGINE=InnoDB
VoilĂ ! One-step import of a huge CSV data file using DDL instead of INSERT or LOAD DATA.
Granted, it's less flexible than INSERT or LOAD DATA, because you can't do NULLs or custom overrides of individual columns, or any "replace" or "ignore" features for handling duplicate values. But if you have an input file that is exactly what you want to import, it could make the import very easy.
This is a tad bit hacky, but as of MySQL 8, assuming you know the data structure beforehand and have permissions in the CSV-based schema directory, you can create the table definition in MySQL and then overwrite the generated CSV table file in the data directory with a symlink to the data file:
mysql --execute="CREATE TABLE TEST.CSV_TEST ( test_col VARCHAR(255) ) ENGINE=CSV;"
ln -sf /path/to/data.file /var/lib/mysql/TEST/CSV_TEST.CSV
An advantage here is that this completely obviates the need to run import operations (via LOAD DATA INFILE, etc.), as it allows MySQL to read directly from the symlinked file as if it were the table file.
Drawbacks beyond those inherent to the CSV engine:
table will contain header row if there is one (you'd need to filter it out from read operations)
table metadata in INFORMATION_SCHEMA will not update using this method, will just show the CREATE_TIME for which the initial DDL is run
Note this method is obviously more geared toward READ operations, though update/insert operations could be conducted on the command line using SELECT ... INTO OUTFILE and then copying onto/appending the source file.
The problem is:
I've got a SQLite database which is constantly being updated though a proprietary application.
I'm building an application which uses MySQL and the database design is very different from the one of SQLite.
I then have to copy data from SQLite to MySQL but it should be done very carefully as not everything should be moved, tables and fields have different names and sometimes data from one table goes to two tables (or the opposite).
In short, SQLite should behave as a client to MySQL inserting what is new and updating the old in an automated way. It doesn't need to be updating in real time; every X hours would be enough.
A google search gave me this:
http://migratedb.sourceforge.net/
And asking a friend I got information about the Multisource plugin (Squirrel SQL) in this page:
http://squirrel-sql.sourceforge.net/index.php?page=plugins
I would like to know if there is a better way to solve the problem or if I will have to make a custom script myself.
Thank you!
I recommend a custom script for this:
If it's not a one-to-one conversion between the tables and fields, tools might not help there. In your question, you've said:
...and sometimes data from one table goes to two tables (or the opposite).
If you only want the differences, then you'll need to build the logic for that unless every record in the SQLite db has timestamps.
Are you going to be updating the MySQL db at all? If not, are you okay to completely delete the MySQL db and refresh it every X hours with all the data from SQLite?
Also, if you are comfortable with a scripting language (like php, python, perl, ruby, etc.), they have API's for both SQLite and MySQL; it would be easy enough to build your own script which you can control customise more easily based on program logic. Especially if you want to run "conversions" between the data from one to the other and not just simple mapping.
I hope i understand you correctly, that you will flush the data which are stored in a SQLite DB periodicly to a MySQL DB. Right?
So this is how i would do it.
Create a Cron, which starts the script every x minutes.
Export the Data from SQLite into an CSV-File.
Do an LOAD DATA INFILE an import the CSV Data to MySQL
Code example for LOAD DATA INFILE
LOAD DATA INFILE 'PATH_TO_EXPORTED_CSV' REPLACE INTO TABLE your_table FIELDS TERMINATED BY ';' ENCLOSED BY '\"' LINES TERMINATED BY '\\n' IGNORE 1 LINES ( #value_column1, #unimportend_value, #value_column2, #unimportend_value, #unimportend_value, #value_column3) SET diff_mysql_column1 = #value_column1, diff_mysql_column2 = #value_column2, diff_mysql_column3 = #value_column3);
This Code you can query to as much db tables you want. Also you can change the variables #value_column1.
Im in a hurry. so thats it for now. ask if something is unclear.
Greets Michael
I hava text file full of values like this:
The first line is a list of column names like this:
col_name_1, col_name_2, col_name_3 ......(600 columns)
and all the following columns have values like this:
1101,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1101,1,3.86,65,0.46418,65,0.57151...
What is the best way to import this into mysql?
Specifically how to come up with the proper CREATE TABLE command so that the data will load itself properly? What is the best generic data type which would take in all the above values like 1101 or 3.86 or 0.57151. I am not worried about the table being inefficient in terms of storage as I need this for a one time usage.
I have tried some of the suggestions in other related questions like using Phpmyadmin (it crashes I am guessing due to the large amount of data)
Please help!
Data in CSV files is not normalized; those 600 columns may be spread across a couple of related tables. This is the recommended way of treating those data. You can then use fgetcsv() to read CSV files line-by-line in PHP.
To make MySQL process the CSV, you can create a 600 column table (I think) and issue a LOAD DATA LOCAL INFILE statement (or perhaps use mysqlimport, not sure about that).
The most generic data type would have to be VARCHAR or TEXT for bigger values, but of course you would lose semantics when used on numbers, dates, etc.
I noticed that you included the phpmyadmin tag.
PHPMyAdmin can handle this out of box. It will decide "magically" which types to make each column, and will CREATE the table for you, as well as INSERT all the data. There is no need to worry about LOAD DATA FROM INFILE, though that method can be more safe if you want to know exactly what's going on without relying on PHPMyAdmin's magic tooling.
Try convertcsvtomysql, just upload your csv file and then you can download and/or copy the mysql statement to create the table and insert rows.
I have some CSV data files that I want to import into mySQL. I would like to do the insert in a shell script, so that it can be automated. However, I am a bit weary of having the username and password in clear text in the script
I have the following questions:
I am uncomfortable with the idea of a uname/pwd in clear text in the script (is there anyway around this, or am I being too paranoid)?. Maybe I can set up a user with only INSERT privelege for the table to be inserted?
The database table (into which the raw data is imported) has a unique key based on the table columns. It is also possible that there may be duplicates in the data that I am trying to import. Rather than mySQL barfing (i.e. the entire insertion fails), I would instead want to be able to tell mySQL to EITHER
(a) UPDATE the row with the new data OR
(b) IGNORE the duplicate row.
Whichever setting I choose would be for the ENTIRE import and not on a row by row basis. Are there any flags etc I can pass to mysql in order for it to behave like (a) OR (b) above
Can anyone suggest a starting point on how to write such a (bourne) shell script?
You should read about mysqlimport, which is a command-line tool provided with MySQL. This tool is the fastest way to bulk-load CSV data.
The tool has two options, --replace and --ignore to handle duplicate key conflicts.
Regarding security and avoiding putting the password in plain text in the script, you can also use the MYSQL_PWD environment variable or the .my.cnf file (make sure that file is mode 400 or 600). See End-User Guidelines for Password Security.