Selectively reading from CSV to MySQL - mysql

This is a two part question.
The first is what architecture should I use for the following issue?
And the second is the how i.e. what commands I should use?
I have some log files I want to read into a database. The log files contain fields that are unnecessary (because they can be calculated from other fields).
Approach 1: Should I parse each line of the log file and insert it into the database?
Con: The log entries have to be unique, so I need to first do a SELECT, check if the LogItemID exists, and then INSERT if it doesn’t. This seems to be a high overhead activity and at some point this will be done on an hourly basis.
Approach 2: Or do I use LOAD DATA INFILE (can I even use that in PHP?) and just load the log file into a temporary table, then move the records into the permanent table?
Con: Even in this method though, I will still have to go through the cycle of SELECT, then INSERT.
Approach 3: Or is there a better way? Is there a command to bulk copy records from one table to another with selected fields? Will REPLACE INTO .... ON DUPLICATE UPDATE work (I don't want to UPDATE if the item exists, just ignore) as long as LogItemID is set to UNIQUE ? Either way, I need to throw the extraneous fields out. Which of these approaches is better? Not just easier, but from the standpoint of writing good, scalable code?
P.S. Unrelated, but part of the Architecture issue here is this...
If I have StartTime, EndTime and Interval (EndTime-StartTime), which should I keep - the first two or Interval? And Why?
Edit: To clarify why I did not want to store all three fields - the issue is of course normalization and therefore not good practice. For audit reasons, perhaps I'll store them. Perhaps in another table?
TIA

LOAD DATA INFILE is going to be a lot faster than running individual inserts.
You could load to a separate, temporary table, and then run an INSERT ... SELECT from the temporary table into your actual store. But it's not clear why you would need to do that. To "skip" some fields in the CSV, just assign those to dummy user-defined variables. There's no need to load those fields into the temporary table.
I'd define a UNIQUE key (constraint) and just use INSERT IGNORE; that will be a lot faster than running a separate SELECT, and faster than a REPLACE. (If your requirement is that you don't have any need to update the existing row, you just want to "ignore" the new row.
LOAD DATA INFILE 'my.csv'
IGNORE
INTO TABLE mytable
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
( mycol
, #dummy2
, #dummy3
, #mm_dd_yyyy
, somecol
)
SET mydatecol = STR_TO_DATE(#mm_dd_yyyy,'%m-%d-%Y')
If you have start, end and duration, go ahead and store all three. There's redundancy there, the main issues are performance and update anomalies. (If you update end, should you also update duration?) If I don't have a need to do updates, I'd just store all three. I could calculate duration from start_time and end_time, but having the column stored would allow me to add an index, and get better performance on queries looking for durations less than 10 minutes, or whatever. Absent the column, I'd be forced to evaluate an expression for every row in the table, and that gets expensive on large sets.

You could use perl to parse out a subset of the csv fields you want to load, then use the command 'uniq' to remove duplicates, then use LOAD DATA INFILE to load the result.
Typically loading data into a temporary table, then traversing is slower than preprocessing the data ahead of time. As for the LogItemID, if you set it to unique the inserts should fail when you load subsequent matching lines.
When it comes to deciding to store StartTime+Interval (more typically called Duration) or StartTime and EndTime, it really depends on how you plan on using the resulting database table. If you store the duration and are constantly computing the end time it might be better to just store the start/end. If you believe the duration will be commonly used, store it. Depending on how big the database you might decide to just store all three, one more field may not add much overhead.

Related

Maintain data integrity and consistency when performing sql batch insert/update with unique columns

I have an excel file that contains contents from the database when downloaded. Each row is identified using an identifier called id_number. Users can add new rows on the file with a new unique id_number. When it is uploaded, for each excel row,
When the id_number exist on the database, an update is performed on the database row.
When the id_number does not exist on the database, an insert is performed on the database row.
Other than the excel file, data can be added or updated individually using a file called report.php. Users use this page if they only want to add one data for an employee, for example.
Ideally, I would like to do an insert ... on duplicate key update for maximum performance. I might also put all of them in a transaction. However, I believe this overall process have some flaws:
Before any add/updates, validation checks have to be done on all excel rows against their corresponding database rows. The reason is because there are many unique columns in the table. That's why I'll have to do some select statements to insure that the data is valid before performing any add/update. Is this efficient on tables with 500 rows and 69 columns? I could probably just get all the data and store all of them in a php array and do the validation check on the array, but what happens if someone adds a new row (with an id_number of 5) through report.php? Then suppose the excel file I uploaded also contains a row with an id_number 5? That could probably destroy my validations because I can not be sure my data is up to date without performing a lot of select statements.
Suppose the system is in the middle of a transaction adding/updating the data retrieved from the excel file, then someone from report.php adds a row because all the validations have been satisfied (E.G. no duplicate id_numbers). Suppose at this point in time the next row to be added from the excel file and the row that will be added by the user on report.php have the same id_number. What happens then? I don't have much knowledge on transactions, I think they at least prevents two queries changing a row at the same time? Is that correct?
I don't really mind these kinds of situations that much. But some files have many rows and it might take a long time to process all of them.
One way I've thought of fixing this is: while the excel file upload is processing, I'll have to prevent users using report.php to modify the rows currently held by the excel file. Is this fine?
What could be the best way to fix these problems? I am using mysql.
If you really need to allow the user to generate their own unique ID then the you could lock the table in question while you're doing you validation and inserting.
If you acquire a write lock, then you can be certain the table isn't changed while you do your work of validation and inserting.
`mysql> LOCK TABLES tbl_name WRITE`
don't forget to
`mysql> UNLOCK TABLES;`
The downside with locking is obvious, the table is locked. If it is high traffic, then all your traffic is waiting, and that could lead all kinds of pain, (mysql running out of connections, would be one common one)
That said, I would suggest a different path altogether, let mysql be the only one who generates a unique id. That is make sure the database table have an auto_increment unique id (primary key) and then have new records in the spreadsheet entered without without the unique id given. Then mysql will ensure that the new records get a unique id, and you don't have to worry about locking and can validate and insert without fear of a collision.
In regards to the question as to performance with a 500 records 69 column table, I can only say that if the php server and the mysql server are reasonably sized and the columns aren't large data types then this amount of data should be readily handled in a fractions of a second. That said performance can be sabotaged by one bad line of code so if your code is slow to perform, I would take that as a separate optimisation problem.

Fastest way to copy a large MySQL table?

What's the best way to copy a large MySQL table in terms of speed and memory use?
Option 1. Using PHP, select X rows from old table and insert them into the new table. Proceed to next iteration of select/insert until all entries are copied over.
Option 2. Use MySQL INSERT INTO ... SELECT without row limits.
Option 3. Use MySQL INSERT INTO ... SELECT with a limited number of rows copied over per run.
EDIT: I am not going to use mysqldump. The purpose of my question is to find the best way to write a database conversion program. Some tables have changed, some have not. I need to automate the entire copy over / conversion procedure without worrying about manually dumping any tables. So it would be helpful if you could answer which of the above options is best.
There is a program that was written specifically for this task called mysqldump.
mysqldump is a great tool in terms of simplicity and careful handling of all types of data, but it is not as fast as load data infile
If you're copying on the same database, I like this version of Option 2:
a) CREATE TABLE foo_new LIKE foo;
b) INSERT INTO foo_new SELECT * FROM foo;
I've got lots of tables with hundreds of millions of rows (like 1/2B) AND InnoDB AND several keys AND constraints. They take many many hours to read from a MySQL dump, but only an hour or so by load data infile. It is correct that copying the raw files with the DB offline is even faster. It is also correct that non-ASCII characters, binary data, and NULLs need to be handled carefully in CSV (or tab-delimited files), but fortunately, I've pretty much got numbers and text :-). I might take the time to see how long the above steps a) and b) take, but I think they are slower than the load data infile... which is probably because of transactions.
Off the three options listed above.
I would select the second option if you have a Unique constraint on at least one column, therefore not creating duplicate rows if the script has to be run multiple times to achieve its task in the event of server timeouts.
Otherwise your third option would be the way to go, while manually taking into account any server timeouts to determine your insert select limits.
Use a stored procedure
Option two must be fastest, but it's gonna be a mighty long transaction. You should look into making a stored procedure doing the copy. That way you could offload some of the data parsing/handling from the MySQL engine.
MySQL's load data query is faster than almost anything else, however it requires exporting each table to a CSV file.
Pay particular attention to escape characters and representing NULL values/binary data/etc in the CSV to avoid data loss.
If possible, the fastest way will be to take the database offline and simply copy data files on disk.
Of course, this have some requirements:
you can stop the database while copying.
you are using a storage engine that stores each table in individual files, MyISAM does this.
you have privileged access to the database server (root login or similar)
Ah, I see you have edited your post, then I think this DBA-from-hell approach is not an option... but still, it's fast!
The best way i find so far is creating the files as dump files(.txt), by using the outfile to a text then using infile in mysql to get the same data to the database

Convert Legacy Text Databases to SQL

At my office we have a legacy accounting system that stores all of its data in plaintext files (TXT extension) with fixed-width records. Each data file is named e.g., FILESALE.TXT. My goal is to bring this data into our MySQL server for read-only usage by many other programs that can't interface with the legacy software. Each file is essentially one table.
There are about 20 files in total that I need to access, roughly 1gb of total data. Each line might be 350-400 characters wide and have 30-40 columns. After pulling the data in, no MySQL table is much bigger than 100mb.
The legacy accounting system can modify any row in the text file, delete old rows (it has a deleted record marker -- 0x7F), and add new rows at any time.
For several years now I have been running a cron job every 5 minutes that:
Checks each data file for last modification time.
If the file is not modified, skip it. Otherwise:
Parse the data file, clean up any issues (very simple checks only), and spit out a tab-delimited file of the columns I need (some of the columns I just ignore).
TRUNCATE the table and imports the new data into our MySQL server like this:
START TRANSACTION;
TRUNCATE legacy_sales;
LOAD DATA INFILE '/tmp/filesale.data' INTO TABLE legacy_sales;
COMMIT;
The cron script runs each file check and parse in parallel, so the whole updating process doesn't really take very long. The biggest table (changed infrequently) takes ~30 seconds to update, but most of the tables take less than 5 seconds.
This has been working ok, but there are some issues. I guess it messes with database caching, so each time I have to TRUNCATE and LOAD a table, other programs that use the MySQL database are slow at first. Additionally, when I switched to running the updates in parallel, the database can be in a slightly inconsistent state for a few seconds.
This whole process seems horribly inefficient! Is there a better way to approach this problem? Any thoughts on optimizations or procedures that might be worth investigating? Any neat tricks from anyone who faced a similar situation?
Thanks!
Couple of ideas:
If the rows in the text files have a modification timestamp, you could update your script to keep track of when it runs, and then only process the records that have been modified since the last run.
If the rows in the text files have a field that can act as a primary key, you could maintain a fingerprint cache for each row, keyed by that id. Use this to detect when a row changes, and skip unchanged rows. I.e., in the loop that reads the text file, calculate the SHA1 (or whatever) hash of the whole row, and then compare that to the hash from your cache. If they match, the row hasn't changed, so skip it. Otherwise, update/insert the MySQL record and the store the new hash value in the cache. The cache could be a GDBM file, a memcached server, a fingerprint field in your MySQL tables, whatever. This will leave unchanged rows untouched (and thus still cached) on MySQL.
Perform updates inside a transaction to avoid inconsistencies.
Two things come to mind and I won't go into too much detail but feel free to ask questions:
A service that offloads the processing of the file to an application server and then just populates the mySQL table, you can even build in intelligence by checking for duplicate records, rather than truncating the entire table.
Offload the processing to another mysql server and replicate / transfer it over.
I agree with alex's tips. If you can, update only modified fields and mass update with transactions and multiple inserts grouped. an additional benefit of transactions is faster updat
if you are concerned about down time, instead of truncating the table, insert into a new table. then rename it.
for improved performance, make sure you have proper indexing on the fields.
look at database specific performance tips such as
_ delayed_inserts in mysql improve performance
_ caches can be optimized
_ even if you do not have unique rows, you may (or may not) be able to md5 the rows

MySQL, best way to insert 1000 rows

I have an "item" table with the following structure:
item
id
name
etc.
Users can put items from this item table into their inventory. I store it in the inventory table like this:
inventory
id
item_id
user_id
Is it OK to insert 1000 rows into inventory table? What is the best way to insert 1000 rows?
MySQL can handle millions of records in a single table without any tweaks. With little tweaks it can handle hundreds of millions (I did that). So I wouldn't worry about that.
To improve insert performance you should use batch inserts.
insert into table my_table(col1, col2) VALUES (val1_1, val2_1), (val1_2, val2_2);
Storing records to a file and using load data infile yields even better results (best in my case), but it requires more effort.
It's okay to insert 1000 rows. You should do this as a transaction so the indices are updated all at once at the commit.
You can also construct a single INSERT statement to insert many rows at a time. (See the syntax for INSERT.) However, I wonder how advisable it would be to do that for 1,000 rows.
Most efficient would probably be to use LOAD DATA INFILE or LOAD XML
When it gets to 1000's, I usually use write to a pipe-delimited CSV file and use LOAD DATA INFILE to suck it in quick. By writing to disk, you avoid issues with overflowing your string buffer, if the language you are using has limits on string size. LOAD DATA INFILE is optimized for bulk uploads.
I've done this with up to 1 billion rows (on a cheap $400 4GB 3 year old 32-bit Ubuntu box), so one thousand is not an issue.
Added note: if you don't care about the id assigned and you just want a new unique ID for every record you insert, you could consider setting up AUTO_INCREMENT on id in the table and let Mysql assign an ID for you.
It kind of also depends on how many users you have, if you have 1,000,000 users all doing 1,000 inserts every few minutes then the server is going to struggle to keep up. From a mysql point of view it is certainly capable of handling that much data.

Merging auto-increment table data

I have multiple end-user mySQL dbs with a fairly large amount of data that must be synchronized with a database (also mySQL) populated by an external data feed. End users can add data to their "local" DB, but not to the feed.
The question is how to merge/synchronize the two databases including the foreign keys between the tables of the DBs, without either overwriting the "local" additions or changing the key of the local additions.
Things I've considered include using a csv dump of the feed DB and doing a LOAD DATA INFILE with IGNORE, and then just comparing the files to see which rows from the feed didn't get written, and write them manually and writing some script to go line by line through the feed DB and create new rows in the local DBs, creating new keys at the same time. However, this seems like it could be very slow, particularly with multiple dbs.
Any thoughts on this? If there was a way to merge these DBs, preserving the keys with a sort of load infile simplicity and speed, that would be ideal.
Use a compound primary key.
primary key(id, source_id)
Make each db use a different value for source_id. That way you can copy database contents around without having PK clashes.
One option would be to use GUIDs rather than integer keys, but it may not be practical to make such a significant change.
Assuming that you're just updating the user databases from the central "feed" database, I'd use CSV and LOAD INFILE, but load into a staging table within the target database. You could then replace the keys with new values, and finally insert the rows into the permanent tables.
If you're not dealing with huge data volumes, it could be as simple as finding the difference between the highest ID of the existing data and the lowest ID of the incoming data. Add this amount to all of the keys in your incoming data, and there should be no collisions. This would waste some PK values, but that's probably not worth worrying about unless your record count is in the millions. This assumes that your PKs are integers and sequential.