Convert Legacy Text Databases to SQL - mysql

At my office we have a legacy accounting system that stores all of its data in plaintext files (TXT extension) with fixed-width records. Each data file is named e.g., FILESALE.TXT. My goal is to bring this data into our MySQL server for read-only usage by many other programs that can't interface with the legacy software. Each file is essentially one table.
There are about 20 files in total that I need to access, roughly 1gb of total data. Each line might be 350-400 characters wide and have 30-40 columns. After pulling the data in, no MySQL table is much bigger than 100mb.
The legacy accounting system can modify any row in the text file, delete old rows (it has a deleted record marker -- 0x7F), and add new rows at any time.
For several years now I have been running a cron job every 5 minutes that:
Checks each data file for last modification time.
If the file is not modified, skip it. Otherwise:
Parse the data file, clean up any issues (very simple checks only), and spit out a tab-delimited file of the columns I need (some of the columns I just ignore).
TRUNCATE the table and imports the new data into our MySQL server like this:
START TRANSACTION;
TRUNCATE legacy_sales;
LOAD DATA INFILE '/tmp/filesale.data' INTO TABLE legacy_sales;
COMMIT;
The cron script runs each file check and parse in parallel, so the whole updating process doesn't really take very long. The biggest table (changed infrequently) takes ~30 seconds to update, but most of the tables take less than 5 seconds.
This has been working ok, but there are some issues. I guess it messes with database caching, so each time I have to TRUNCATE and LOAD a table, other programs that use the MySQL database are slow at first. Additionally, when I switched to running the updates in parallel, the database can be in a slightly inconsistent state for a few seconds.
This whole process seems horribly inefficient! Is there a better way to approach this problem? Any thoughts on optimizations or procedures that might be worth investigating? Any neat tricks from anyone who faced a similar situation?
Thanks!

Couple of ideas:
If the rows in the text files have a modification timestamp, you could update your script to keep track of when it runs, and then only process the records that have been modified since the last run.
If the rows in the text files have a field that can act as a primary key, you could maintain a fingerprint cache for each row, keyed by that id. Use this to detect when a row changes, and skip unchanged rows. I.e., in the loop that reads the text file, calculate the SHA1 (or whatever) hash of the whole row, and then compare that to the hash from your cache. If they match, the row hasn't changed, so skip it. Otherwise, update/insert the MySQL record and the store the new hash value in the cache. The cache could be a GDBM file, a memcached server, a fingerprint field in your MySQL tables, whatever. This will leave unchanged rows untouched (and thus still cached) on MySQL.
Perform updates inside a transaction to avoid inconsistencies.

Two things come to mind and I won't go into too much detail but feel free to ask questions:
A service that offloads the processing of the file to an application server and then just populates the mySQL table, you can even build in intelligence by checking for duplicate records, rather than truncating the entire table.
Offload the processing to another mysql server and replicate / transfer it over.

I agree with alex's tips. If you can, update only modified fields and mass update with transactions and multiple inserts grouped. an additional benefit of transactions is faster updat
if you are concerned about down time, instead of truncating the table, insert into a new table. then rename it.
for improved performance, make sure you have proper indexing on the fields.
look at database specific performance tips such as
_ delayed_inserts in mysql improve performance
_ caches can be optimized
_ even if you do not have unique rows, you may (or may not) be able to md5 the rows

Related

Database query efficiency

My boss is having me create a database table that keeps track of some of our inventory with various parameters. It's meant to be implemented as a cron job that runs every half hour or so, but the scheduling part isn't important since we've already discussed that we're handling it later.
What I'm want to know is if it's more efficient to just delete everything in the table each time the script is called and repopulate it, or go through each record to determine if any changes were made and update each entry accordingly. It's easier to do the former, but given that we have over 700 separate records to keep track of, I don't know if the time it takes to do this would put a huge load on the server. The script is written in PHP.
700 records is an extremely small number of records to have performance concerns. Don't even think about it, do whichever is easier for you.
But if it is performance that you are after, updating rows is slower than inserting rows, (especially if you are not expecting any generated keys, so an insertion is a one-way operation to the database instead of a roundtrip to and from the database,) and TRUNCATE TABLE tends to be faster than DELETE * FROM.
If you have IDs for the proper inventory talking about SQL DB, then it would be good practice to update them, since in theory your IDs will get exhausted (overflow).
Another approach would be to use some NoSQL DB like MongoDB and simply update the DB with given json bodies apparently with existing IDs, and the DB itself will figure it out on its own.

SSIS to insert non-matching data on non-linked server

This is regarding SQL Server 2008 R2 and SSIS.
I need to update dozens of history tables on one server with new data from production tables on another server.
The two servers are not, and will not be, linked.
Some of the history tables have 100's of millions of rows and some of the production tables have dozens of millions of rows.
I currently have a process in place for each table that uses the following data flow components:
OLEDB Source task to pull the appropriate production data.
Lookup task to check if the production data's key already exists in the history table and using the "Redirect to error output" -
Transfer the missing data to the OLEDB Destination history table.
The process is too slow for the large tables. There has to be a better way. Can someone help?
I know if the servers were linked a single set based query could accomplish the task easily and efficiently, but the servers are not linked.
Segment your problem into smaller problems. That's the only way you're going to solve this.
Let's examine the problems.
You're inserting and/or updating existing data. At a database level, rows are packed into pages. Rarely is it an exact fit and there's usually some amount of free space left in a page. When you update a row, pretend the Name field went from "bob" to "Robert Michael Stuckenschneider III". That row needs more room to live and while there's some room left on the page, there's not enough. Other rows might get shuffled down to the next page just to give this one some elbow room. That's going to cause lots of disk activity. Yes, it's inevitable given that you are adding more data but it's important to understand how your data is going to grow and ensure your database itself is ready for that growth. Maybe, you have some non-clustered indexes on a target table. Disabling/dropping them should improve insert/update performance. If you still have your database and log set to grow at 10% or 1MB or whatever the default values are, the storage engine is going to spend all of its time trying to grow files and won't have time to actually write data. Take away: ensure your system is poised to receive lots of data. Work with your DBA, LAN and SAN team(s)
You have tens of millions of rows in your OLTP system and hundreds of millions in your archive system. Starting with the OLTP data, you need to identify what does not exist in your historical system. Given your data volumes, I would plan for this package to have a hiccup in processing and needs to be "restartable." I would have a package that has a data flow with only the business keys selected from the OLTP that are used to make a match against the target table. Write those keys into a table that lives on the OLTP server (ToBeTransfered). Have a second package that uses a subset of those keys (N rows) joined back to the original table as the Source. It's wired directly to the Destination so no lookup required. That fat data row flows on over the network only one time. Then have an Execute SQL Task go in and delete the batch you just sent to the Archive server. This batching method can allow you to run the second package on multiple servers. The SSIS team describes it better in their paper: We loaded 1TB in 30 minutes
Ensure the Lookup is a Query of the form SELECT key1, key2 FROM MyTable Better yet, can you provide a filter to the lookup? WHERE ProcessingYear = 2013 as there's no need to waste cache on 2012 if the OLTP only contains 2013 data.
You might need to modify your PacketSize on your Connection Manager and have a network person set up Jumbo frames.
Look at your queries. Are you getting good plans? Are your tables over-indexed? Remember, each index is going to result in an increase in the number of writes performed. If you can dump them and recreate after the processing is completed, you'll think your SAN admins bought you some FusionIO drives. I know I did when I dropped 14 NC indexes from a billion row table that only had 10 total columns.
If you're still having performance issues, establish a theoretical baseline (under ideal conditions that will never occur in the real world, I can push 1GB from A to B in N units of time) and work your way from there to what your actual is. You must have a limiting factor (IO, CPU, Memory or Network). Find the culprit and throw more money at it or restructure the solution until it's no longer the lagging metric.
Step 1. Incremental bulk import of appropriate proudction data to new server.
Ref: Importing Data from a Single Client (or Stream) into a Non-Empty Table
http://msdn.microsoft.com/en-us/library/ms177445(v=sql.105).aspx
Step 2. Use Merge Statement to identify new/existing records and operate on them.
I realize that it will take a significant amount of disk space on the new server, but the process would run faster.

MySql, LOAD DATA or BATCH INSERT or any other better way for bulk inserts

I am trying to create a web application, primary objective is to insert request data into database.
Here is my problem, One request itself contains 10,000 to 1,00,000 data sets of information
(Each data set needs to be inserted separately as a row in the database)
I may get multiple request on this application concurrently, so its necessary for me to make the inserts fast.
I am using MySQL database, Which approach is better for me, LOAD DATA or BATCH INSERT or is there a better way than these two?
How will your application retrieve this information?
- There will be another background thread based java application that will select records from this table process them one by one and delete them.
Can you queue your requests (batches) so your system will handle them one batch at a time?
- For now we are thinking of inserting it to database straightaway, but yes if this approach is not feasible enough we may think of queuing the data.
Do retrievals of information need to be concurrent with insertion of new data?
- Yes, we are keeping it concurrent.
Here are certain answers to your questions, Ollie Jones
Thankyou!
Ken White's comment mentioned a couple of useful SO questions and answers for handling bulk insertion. For the record volume you are handling, you will enjoy the best success by using MyISAM tables and LOAD DATA INFILE data loading, from source files in the same file system that's used by your MySQL server.
What you're doing here is a kind of queuing operation. You receive these batches (you call them "requests") of records (you call them "data sets.) You put them into a big bucket (your MySQL table). Then you take them out of the bucket one at a time.
You haven't described your problem completely, so it's possible my advice is wrong.
Is each record ("data set") independent of all the others?
Does the order in which the records are processed matter? Or would you obtain the same results if you processed them in a random order? In other words, do you have to maintain an order on the individual records?
What happens if you receive two million-row batches ("requests") at approximately the same time? Assuming you can load ten thousand records a second (that's fast!) into your MySQL table, this means it will take 200 seconds to load both batches completely. Will you try to load one batch completely before beginning to load the second?
Is it OK to start processing and deleting the rows in these batches before the batches are completely loaded?
Is it OK for a record to sit in your system for 200 or more seconds before it is processed? How long can a record sit? (this is called "latency").
Given the volume of data you're mentioning here, if you're going into production with living data you may want to consider using a queuing system like ActiveMQ rather than a DBMS.
It may also make sense simply to build a multi-threaded Java app to load your batches of records, deposit them into a Queue object in RAM (a ConcurrentLinkedQueue instance may be suitable) and process them one by one. This approach will give you much more control over the performance of your system than you will have by using a MySQL table as a queue.

Uploading enormous amount of data to MySQL server

I have to upload about 16 million records to a MySQL 5.1 server on a shared webspace which does not permit LOAD DATA functionality. The table is an Innodb table. I have not assigned any keys yet.
Therefore, I use a Python script to convert my CSV file (of 2.5 GB of size) to an SQL file with individual INSERT statements. I've launched the SQL file, and the process is incredibly slow, it feels like 1000-1500 lines are processed every minute!
In the meantime, I read about bulk inserts, but did not find any reliable source telling how many records one insert statement can have. Do you know?
Is it an advantage to have no keys and add them later?
Would a transaction around all the insert help speed up the process? In fact, there's just a single connection (mine) working with the database at this time.
If you use insert ... values ... syntax to insert multiple rows running a single request your query size is limited by max_allowed_packet value rather than by number of rows.
Concerning keys: it's a good practice to define keys before any data manipulations. Actually, when you build a model you must think of keys, relations, indexes etc.
It's better do define indexes before you insert data as well. CREATE INDEX works quite slowly on huge datasets. But postponing indexes creation is not a huge disadvantage.
To make your inserts faster try to turn autocommit mode on and do not run concurrent requests on your tables.

Can I use multiple servers to increase mysql's data upload performance?

I am in the process of setting up a mysql server to store some data but realized(after reading a bit this weekend) I might have a problem uploading the data in time.
I basically have multiple servers generating daily data and then sending it to a shared queue to process/analyze. The data is about 5 billion rows(although its very small data, an ID number in a column and a dictionary of ints in another). Most of the performance reports I have seen have shown insert speeds of 60 to 100k/second which would take over 10 hours. We need the data in very quickly so we can work on it that day and then we may discard it(or achieve the table to S3 or something).
What can I do? I have 8 servers at my disposal(in addition to the database server), can I somehow use them to make the uploads faster? At first I was thinking of using them to push data to the server at the same time but I'm also thinking maybe I can load the data onto each of them and then somehow try to merge all the separated data into one server?
I was going to use mysql with innodb(I can use any other settings it helps) but its not finalized so if mysql doesn't work is there something else that will(I have used hbase before but was looking for a mysql solution first in case I have problems seems more widely used and easier to get help)?
Wow. That is a lot of data you're loading. It's probably worth quite a bit of design thought to get this right.
Multiple mySQL server instances won't help with loading speed. What will make a difference is fast processor chips and very fast disk IO subsystems on your mySQL server. If you can use a 64-bit processor and provision it with a LOT of RAM, you may be able to use a MEMORY access method for your big table, which will be very fast indeed. (But if that will work for you, a gigantic Java HashMap may work even better.)
Ask yourself: Why do you need to stash this info in a SQL-queryable table? How will you use your data once you've loaded it? Will you run lots of queries that retrieve single rows or just a few rows of your billions? Or will you run aggregate queries (e.g. SUM(something) ... GROUP BY something_else) that grind through large fractions of the table?
Will you have to access the data while it is incompletely loaded? Or can you load up a whole batch of data before the first access?
If all your queries need to grind the whole table, then don't use any indexes. Otherwise do. But don't throw in any indexes you don't need. They are going to cost you load performance, big time.
Consider using myISAM rather than InnoDB for this table; myISAM's lack of transaction semantics makes it faster to load. myISAM will do fine at handling either aggregate queries or few-row queries.
You probably want to have a separate table for each day's data, so you can "get rid" of yesterday's data by either renaming the table or simply accessing a new table.
You should consider using the LOAD DATA INFILE command.
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
This command causes the mySQL server to read a file from the mySQL server's file system and bulk-load it directly into a table. It's way faster than doing INSERT commands from a client program on another machine. But it's also tricker to set up in production: your shared queue needs access to the mySQL server's file system to write the data files for loading.
You should consider disabling indexing, then loading the whole table, then re-enabling indexing, but only if you don't need to query partially loaded tables.