MySQL bulk insert from CSV data files - mysql

I have some CSV data files that I want to import into mySQL. I would like to do the insert in a shell script, so that it can be automated. However, I am a bit weary of having the username and password in clear text in the script
I have the following questions:
I am uncomfortable with the idea of a uname/pwd in clear text in the script (is there anyway around this, or am I being too paranoid)?. Maybe I can set up a user with only INSERT privelege for the table to be inserted?
The database table (into which the raw data is imported) has a unique key based on the table columns. It is also possible that there may be duplicates in the data that I am trying to import. Rather than mySQL barfing (i.e. the entire insertion fails), I would instead want to be able to tell mySQL to EITHER
(a) UPDATE the row with the new data OR
(b) IGNORE the duplicate row.
Whichever setting I choose would be for the ENTIRE import and not on a row by row basis. Are there any flags etc I can pass to mysql in order for it to behave like (a) OR (b) above
Can anyone suggest a starting point on how to write such a (bourne) shell script?

You should read about mysqlimport, which is a command-line tool provided with MySQL. This tool is the fastest way to bulk-load CSV data.
The tool has two options, --replace and --ignore to handle duplicate key conflicts.
Regarding security and avoiding putting the password in plain text in the script, you can also use the MYSQL_PWD environment variable or the .my.cnf file (make sure that file is mode 400 or 600). See End-User Guidelines for Password Security.

Related

Importing and exporting TSVs with MySQL

I'm using a database with MySQL 5.7, and sometimes, data needs to be updated using a mixture of scripts and manual editing. Because people working with the database are usually not familiar with SQL, I'd like to export the data as a TSV, which then could be manipulated (for example with Python's pandas module) and then be imported back. I assume the standard way would be to directly connect to the database, but using TSVs has some upsides in this situation, I think. I've been reading the MySQL docs and some stackoverflow questions to find the best way to do this. I've found a couple of solutions, however, they all are somewhat inconvenient. I will list them below and explain my problems with them.
My question is: did I miss something, for example some helpful SQL commands or CLI options to help with this? Or are the solutions I found already the best when importing/exporting TSVs?
My example database looks like this:
Database: Export_test
Table: Sample
Field
Type
Null
Key
id
int(11)
NO
PRI
text_data
text
NO
optional
int(11)
YES
time
timestamp
NO
Example data:
INSERT INTO `Sample` VALUES (1,'first line\\\nsecond line',NULL,'2022-02-16 20:17:38');
The data contains an escaped newline, which caused a lot of problems for me when exporting.
Table: Reference
Field
Type
Null
Key
id
int(11)
NO
PRI
foreign_key
int(11)
NO
MUL
Example data:
INSERT INTO `Reference` VALUES (1,1);
foreign_key is referencing a Sample.id.
Note about encoding: As a caveat for people trying to do the same thing: If you want to export/import data, make sure that characters sets and collations are set up correctly for connections. This caused me some headache, because although the data itself is utf8mb4, the client, server and connection character sets were latin1, which caused some loss of data in some instances.
Export
So, for exporting, I found basically three solutions, and they all behave somewhat differently:
A: SELECT stdout redirection
mysql Export_test -e "SELECT * FROM Sample;" > out.tsv
Output:
id text_data optional time
1 first line\\\nsecond line NULL 2022-02-16 21:26:13
Pros:
headers are added, which makes it easy to use with external programs
formatting works as intended
Cons:
NULL is used for null values; when importing, \N is required instead; as far as I know, this can't be configured for exports
Workaround: replace NULL values when editing the data
B: SELECT INTO OUTFILE
mysql Export_test -e "SELECT * FROM Sample INTO OUTFILE '/tmp/out.tsv';"
Output:
1 first line\\\
second line \N 2022-02-16 21:26:13
Pros:
\N is used for null data
Cons:
escaped linebreaks are not handled correctly
headers are missing
file writing permission issues
Workaround: fix linebreaks manually; add headers by hand or supply them in the script; use /tmp/ as output directory
C: mysqldump with --tab (performs SELECT INTO OUTFILE behind the scenes)
mysqldump --tab='/tmp/' --skip-tz-utc Export_test Sample
Output, pros and cons: same as export variant B
Something that should be noted: the output is only the same as B, if --skip-tz-utc is used; otherwise, timestamps will be converted to UTC, and will be off after importing the data.
Import
Something I didn't realize it first, is that it's impossible to merely update data directly with LOAD INTO or mysqlimport, although that's something many GUI tools appear to be doing and other people attempted. For me as an beginner, this wasn't immediately clear from the MySQL docs. A workaround appears to be creating an empty table, import the data there and then updating the actual table of interest via a join. I also thought one could update individual columns with this, which again is not possible. If there are some other ways to achieve this, I would really like to know.
As far as I could tell, there are two options, which do pretty much the same thing:
LOAD INTO:
mysql Export_test -e "SET FOREIGN_KEY_CHECKS = 0; LOAD DATA INFILE '/tmp/Sample.tsv' REPLACE INTO TABLE Sample IGNORE 1 LINES; SET FOREIGN_KEY_CHECKS = 1;"
mysqlimport (performs LOAD INTO behind the scenes):
mysqlimport --replace Export_test /tmp/Sample.tsv
Notice: if there are foreign key constraints like in this example, SET FOREIGN_KEY_CHECKS = 0; needs to be performed (as far as I can tell, mysqlimport can't be directly used in these cases). Also, IGNORE 1 LINES or --ignore-lines can be used to skip the first line if the input TSV contains a header. For mysqlimport, the name of the input file without extension must be the name of the table. Again, file reading permissions can be an issue, and /tmp/ is used to avoid that.
Are there ways to make this process more convenient? Like, are there some options I can use to avoid the manual workarounds, or are there ways to use TSV importing to UPDATE entries without creating a temporary table?
What I ended up doing was using LOAD INTO OUTFILE for exporting, added a header manually and also fixed the malformed lines by hand. After manipulating the data, I used LOAD DATA INTO to update the data. In another case, I exported with SELECT to stdout redirection, manipulated the data and then added a script, which just created a file with a bunch of UPDATE ... WHERE statements with the corresponding data. Then I ran the resulting .sql in my database. Is the latter maybe the best option in this case?
Exporting and importing is indeed sort of clunky in MySQL.
One problem is that it introduces a race condition. What if you export data to work on it, then someone modifies the data in the database, then you import your modified data, overwriting your friend's recent changes?
If you say, "no one is allowed to change data until you re-import the data," that could cause an unacceptably long time where clients are blocked, if the table is large.
The trend is that people want the database to minimize downtime, and ideally to have no downtime at all. Advancements in database tools are generally made with this priority in mind, not so much to accommodate your workflow of taking the data out of MySQL for transformations.
Also what if the database is large enough that the exported data causes a problem because where do you store a 500GB TSV file? Does pandas even work on such a large file?
What most people do is modify data while it remains in the database. They use in-place UPDATE statements to modify data. If they can't do this in one pass (there's a practical limit of 4GB for a binary log event, for example), then they UPDATE more modest-size subsets of rows, looping until they have transformed the data on all rows of a given table.

Erasing records from text file after importing it to MySQL database

I know how to import a text file into MySQL database by using the command
LOAD DATA LOCAL INFILE '/home/admin/Desktop/data.txt' INTO TABLE data
The above command will write the records of the file "data.txt" into the MySQL database table. My question is that I want to erase the records form the .txt file once it is stored in the database.
For Example: If there are 10 records and at current point of time 4 of them have been written into the database table, I require that in the data.txt file these 4 records get erased simultaneously. (In a way the text file acts as a "Queue".) How can I accomplish this? Can a java code be written? Or a scripting language is to be used?
Automating this is not too difficult, but it is also not trivial. You'll need something (a program, a script, ...) that can
Read the records from the original file,
Check if they were inserted, and, if they were not, copy them in another file
Rename or delete the original file, and rename the new file to replace the original one.
There might be better ways of achieving what you want to do, but, that's not something I can comment on without knowing your goal.

PHPMyAdmin put the most recent database

I'm wondering if there is any better way than going through the tables one by one adding the columns missing when some fields/tables needs to be added because of the most recent changes in the app?
For example, I'm working at the localhost and when I finish doing the new version of my app, I will put all the files into my FTP and, sometimes, I have done, in my local database, changes and so it means that I also need to update my database at my server.
There's any better way to add/edit the columns/tables without changing the info? Some of the columns are also deleted, etc.
Hopefully you've thought your database design through so that making changes to the structure is a rare occurrence. If you're making regular changes to the number of columns or adding tables, it's likely a sign that you haven't normalized your database structure sufficiently.
Anyway, I'd script it as an SQL file that you deploy (which you can then run through phpMyAdmin or the command line or any other means you prefer to execute SQL queries). This has the added advantage of being something you can easily duplicate across your development and production databases, send to customers, and if you wish store in version control so you know when exactly you made the changes to the database.
This way, you'll end up with an SQL file that has a couple of statements like
ALTER TABLE `foo` ADD `new` INT NOT NULL ;
or something similar.
As for how you'd make the file, probably the easiest way is just copying and pasting the generated SQL statement from phpMyAdmin after modifying the table -- the SQL code used to make the change is shown near the top of the screen on the next page. You can copy and paste that to a new text file to create your SQL file. You may wish to add the first line
use `baz`;
using your database name instead of "baz". That way you don't have to specify on import which database the changes are meant for.
Hope this helps.

MySQL backup multi-client DB for single client

I am facing a problem for a task I have to do at work.
I have a MySQL database which holds the information of several clients of my company and I have to create a backup/restore procedure to backup and restore such information for any single client. To clarify, if my client A is losing his data, I have to be able to recover such data being sure I am not modifying the data of client B, C, ...
I am not a DB administrator, so I don't know if I can do this using standard mysql tools (such as mysqldump) or any other backup tools (such as Percona Xtrabackup).
To backup, my research (and my intuition) led my to this possibile solution:
create the restore insert statement using the insert-select syntax (http://dev.mysql.com/doc/refman/5.1/en/insert-select.html);
save this inserts into a sql file, either in proper order or allowing this script to temporarily disable the foreign key checks to meet foreign keys' constraint;
of course, I do this for all my clients on a daily base, using a file for each client (and day).
Then, in the case I have to restore the data for a specific client:
I delete all his data left;
I restore the correct data using his sql file I created during the backup.
This way I believe I may recover the right data of client A without touching the data of client B. Is my solution eventually working? Is there any better way to achieve the same result? Or do you need more information about my problem?
Please, forgive me if this question is not well-formed, but I am new here and this is my first question so I may be unprecise...thanks anyway for the help.
Note: we will also backup the entire database with mysqldump.
You can use the --where parameter, you could provide a condition like *client_id=N* . Of course I am making an assumption since you don't provide any information on your schema.
If you have a Star schema , then you could probably write a small script that backups all lookup tables (considering they are adequately small) by using this parameter --tables and use the --where condition for your client data table. For additional performance, perhaps you could partition the table by the client_id.

How to dump database from mysql with sensitive data removed or corrupted?

I am using mysql. Some of the tables contain sensitive data like user names, email addresses, etc. I want to dump the data but with these columns in the table removed or modified to some fake data. Is there any way to do it easily?
I'm using this approach:
Copy contents of sensitive tables to a temporary table.
Clear/encrypt the sensitive columns.
Provide --ignore-table arguments to mysqldump.exe to leave the original tables out.
It preserves foreign key contraints, and you can keep columns that are not sensitive.
The first two actions are contained in a stored procedure that I call before doing the dump. It looks something like this:
BEGIN
truncate table person_anonymous;
insert into person_anonymous select * from person;
update person_anonymous set Title=null, Initials=mid(md5(Initials),1,10), Midname=md5(Midname), Lastname=md5(Lastname), Comment=md5(Comment);
END
As you can see, I'm not clearing the contents of the fields. Instead, I keep a hash. That way, you can still see which rows have the same value, and between exports you can see if something changed or not, without anyone being able to read the actual values.
There is a tool called Jailer that is typically used to export a subset of a database. We use this at work to create a smaller test database from a production backup, with all sensitive data obfuscated.
The GUI is a bit crude, but Jailer is the best alternative I have found so far.
You can simply unselect the sensitive tables or columns and get a full copy of the rest. Jailer also supports obfuscating data during export - you could for instance md5 hash all user names or change all email addresses to user#example.org.
There is a tutorial to get you started.
ProxySQL is another approach.
Here is an article explaining how to obfuscate data with proxysql.
https://proxysql.com/blog/obfuscate-data-from-mysqldump