I am pulling a large amount of data (700k lines) from one database (Source Database), performing some data manipulation, then inserting/updating another database (Lookup Database) with the altered data.
The lookup database is being accessed by users regularly throughout the day. They are only performing select statements. The updates to the lookup database occur hourly. As the size of the data in the source database increases, I'm noticing much higher numbers of "lock wait timeout exceeded" errors when updating the lookup database. Is my assumption correct that this is most likely resulting from the select statements and update statements accessing the same data? That would make sense as the larger number of updates occurring would more frequently hit data that is being accessed by the users.
In my attempts to fix the situation I've increased the lock wait timeout to 120 (up from 60) but that has had very little effect.
One thought I had was to update into a new table in the lookup database, then swap (in the software the users are using) the database their queries go to. Any other ideas how I might resolve the issue?
I'm not sure if it's relevant but I'm using MySQL and the two databases are on completely separate servers.
Thanks!
Related
I have a MySql database hosted on a webserver which has a set of tables with data in it. I am distributing my front end application which is build using HTML5 / Javascript /CS3.
Now when multiple users tries to make an insert/update into one of the tables at the same time is it going to create a conflict or will it handle the locking of the table for me automatically example when one user is using, it will lock the table for him and then let the rest follow in a queue once the user finishes it will release the lock and then give it to the next in the queue ? Is this going to happen or do i need to handle the case in mysql database
EXAMPLE:
When a user wants to make an insert into the database he calls a php file located on a webserver which has an insert command to post data into the database. I am concerned if two or more people make an insert at the same time will it make the update.
mysqli_query($con,"INSERT INTO cfv_postbusupdate (BusNumber, Direction, StopNames, Status, comments, username, dayofweek, time) VALUES (".trim($busnum).", '".trim($direction3)."', '".trim($stopname3)."', '".$status."', '".$comments."', '".$username."', '".trim($dayofweek3)."', '".trim($btime3)."' )");
MySQL handles table locking automatically.
Note that with MyISAM engine, the entire table gets locked, and statements will block ("queue up") waiting for a lock to be released.
The InnoDB engine provides more concurrency, and can do row level locking, rather than locking the entire table.
There may be some cases where you want to take locks on multiple MyISAM tables, if you want to maintain referential integrity, for example, and you want to disallow other sessions from making changes to any of the tables while your session does its work. But, this really kills concurrency; this should be more of an "admin" type function, not really something a concurrent application should be doing.
If you are making use of transactions (InnoDB), the issue your application needs to deal with is the sequence in which rows in which tables are locked; it's possible for an application to experience "deadlock" exceptions, when MySQL detects that there are two (or more) transactions that can't proceed because each needs to obtain locks held by the other. The only thing MySQL can do is detect that, and the only recovery MySQL can do for this is to choose one of the transactions to be the victim, that's the transaction that will get the "deadlock" exception, because MySQL killed it, to allow at least one of the transactions to proceed.
I have two databases that are identical except that in one I have about 500.000 entries (distributed over several tables) while the other database is empty.
If I run my program in the empty database then execution takes around 10mins while in the database with the 500k entries execution takes around 40mins. I now deleted some of the entries (about 250k entries) and it speeded up the execution by around 10mins. The strange thing is that these tables where not heavily queried (just some very simple inserts), so I wonder how this can have such an effect on the execution.
Also, all SQL statements that I do (I run a lot of them) are rahter simple (no complicated joins mainly inserts), so I wonder why some tables with 250k entries can have such an effect on the performance. Any ideas what could be the reason?
Following things could be the reason but for actual reasons you should look and profile your queries,
Though you think you are making simple inserts, its not a simple operation from DB perspective. (for every entry you insert following things may change and update
Index
Constraints
Integrity of DB (PK-FK) and there are many things to consider.above things look simple but they take time if volume is high
Check volume of queries (if high no. of insert queries are getting executed then as might be knowing Insert is exclusive operation i.e. it locks the table for updating and volume is high that means more locking time and waiting time.)
to avoid this probably you can try chaining or bulk operations
Is bulk update faster than single update in db2?
Data Distribution also plays important role. if you are accessing heavily loaded tables then parsing/accessing/fetching data from such tables will also take time (it doesn't matter for single query but it really hurts for large volume of similar queries). Try to minimize that by tuning your queries.
This is regarding SQL Server 2008 R2 and SSIS.
I need to update dozens of history tables on one server with new data from production tables on another server.
The two servers are not, and will not be, linked.
Some of the history tables have 100's of millions of rows and some of the production tables have dozens of millions of rows.
I currently have a process in place for each table that uses the following data flow components:
OLEDB Source task to pull the appropriate production data.
Lookup task to check if the production data's key already exists in the history table and using the "Redirect to error output" -
Transfer the missing data to the OLEDB Destination history table.
The process is too slow for the large tables. There has to be a better way. Can someone help?
I know if the servers were linked a single set based query could accomplish the task easily and efficiently, but the servers are not linked.
Segment your problem into smaller problems. That's the only way you're going to solve this.
Let's examine the problems.
You're inserting and/or updating existing data. At a database level, rows are packed into pages. Rarely is it an exact fit and there's usually some amount of free space left in a page. When you update a row, pretend the Name field went from "bob" to "Robert Michael Stuckenschneider III". That row needs more room to live and while there's some room left on the page, there's not enough. Other rows might get shuffled down to the next page just to give this one some elbow room. That's going to cause lots of disk activity. Yes, it's inevitable given that you are adding more data but it's important to understand how your data is going to grow and ensure your database itself is ready for that growth. Maybe, you have some non-clustered indexes on a target table. Disabling/dropping them should improve insert/update performance. If you still have your database and log set to grow at 10% or 1MB or whatever the default values are, the storage engine is going to spend all of its time trying to grow files and won't have time to actually write data. Take away: ensure your system is poised to receive lots of data. Work with your DBA, LAN and SAN team(s)
You have tens of millions of rows in your OLTP system and hundreds of millions in your archive system. Starting with the OLTP data, you need to identify what does not exist in your historical system. Given your data volumes, I would plan for this package to have a hiccup in processing and needs to be "restartable." I would have a package that has a data flow with only the business keys selected from the OLTP that are used to make a match against the target table. Write those keys into a table that lives on the OLTP server (ToBeTransfered). Have a second package that uses a subset of those keys (N rows) joined back to the original table as the Source. It's wired directly to the Destination so no lookup required. That fat data row flows on over the network only one time. Then have an Execute SQL Task go in and delete the batch you just sent to the Archive server. This batching method can allow you to run the second package on multiple servers. The SSIS team describes it better in their paper: We loaded 1TB in 30 minutes
Ensure the Lookup is a Query of the form SELECT key1, key2 FROM MyTable Better yet, can you provide a filter to the lookup? WHERE ProcessingYear = 2013 as there's no need to waste cache on 2012 if the OLTP only contains 2013 data.
You might need to modify your PacketSize on your Connection Manager and have a network person set up Jumbo frames.
Look at your queries. Are you getting good plans? Are your tables over-indexed? Remember, each index is going to result in an increase in the number of writes performed. If you can dump them and recreate after the processing is completed, you'll think your SAN admins bought you some FusionIO drives. I know I did when I dropped 14 NC indexes from a billion row table that only had 10 total columns.
If you're still having performance issues, establish a theoretical baseline (under ideal conditions that will never occur in the real world, I can push 1GB from A to B in N units of time) and work your way from there to what your actual is. You must have a limiting factor (IO, CPU, Memory or Network). Find the culprit and throw more money at it or restructure the solution until it's no longer the lagging metric.
Step 1. Incremental bulk import of appropriate proudction data to new server.
Ref: Importing Data from a Single Client (or Stream) into a Non-Empty Table
http://msdn.microsoft.com/en-us/library/ms177445(v=sql.105).aspx
Step 2. Use Merge Statement to identify new/existing records and operate on them.
I realize that it will take a significant amount of disk space on the new server, but the process would run faster.
At my office we have a legacy accounting system that stores all of its data in plaintext files (TXT extension) with fixed-width records. Each data file is named e.g., FILESALE.TXT. My goal is to bring this data into our MySQL server for read-only usage by many other programs that can't interface with the legacy software. Each file is essentially one table.
There are about 20 files in total that I need to access, roughly 1gb of total data. Each line might be 350-400 characters wide and have 30-40 columns. After pulling the data in, no MySQL table is much bigger than 100mb.
The legacy accounting system can modify any row in the text file, delete old rows (it has a deleted record marker -- 0x7F), and add new rows at any time.
For several years now I have been running a cron job every 5 minutes that:
Checks each data file for last modification time.
If the file is not modified, skip it. Otherwise:
Parse the data file, clean up any issues (very simple checks only), and spit out a tab-delimited file of the columns I need (some of the columns I just ignore).
TRUNCATE the table and imports the new data into our MySQL server like this:
START TRANSACTION;
TRUNCATE legacy_sales;
LOAD DATA INFILE '/tmp/filesale.data' INTO TABLE legacy_sales;
COMMIT;
The cron script runs each file check and parse in parallel, so the whole updating process doesn't really take very long. The biggest table (changed infrequently) takes ~30 seconds to update, but most of the tables take less than 5 seconds.
This has been working ok, but there are some issues. I guess it messes with database caching, so each time I have to TRUNCATE and LOAD a table, other programs that use the MySQL database are slow at first. Additionally, when I switched to running the updates in parallel, the database can be in a slightly inconsistent state for a few seconds.
This whole process seems horribly inefficient! Is there a better way to approach this problem? Any thoughts on optimizations or procedures that might be worth investigating? Any neat tricks from anyone who faced a similar situation?
Thanks!
Couple of ideas:
If the rows in the text files have a modification timestamp, you could update your script to keep track of when it runs, and then only process the records that have been modified since the last run.
If the rows in the text files have a field that can act as a primary key, you could maintain a fingerprint cache for each row, keyed by that id. Use this to detect when a row changes, and skip unchanged rows. I.e., in the loop that reads the text file, calculate the SHA1 (or whatever) hash of the whole row, and then compare that to the hash from your cache. If they match, the row hasn't changed, so skip it. Otherwise, update/insert the MySQL record and the store the new hash value in the cache. The cache could be a GDBM file, a memcached server, a fingerprint field in your MySQL tables, whatever. This will leave unchanged rows untouched (and thus still cached) on MySQL.
Perform updates inside a transaction to avoid inconsistencies.
Two things come to mind and I won't go into too much detail but feel free to ask questions:
A service that offloads the processing of the file to an application server and then just populates the mySQL table, you can even build in intelligence by checking for duplicate records, rather than truncating the entire table.
Offload the processing to another mysql server and replicate / transfer it over.
I agree with alex's tips. If you can, update only modified fields and mass update with transactions and multiple inserts grouped. an additional benefit of transactions is faster updat
if you are concerned about down time, instead of truncating the table, insert into a new table. then rename it.
for improved performance, make sure you have proper indexing on the fields.
look at database specific performance tips such as
_ delayed_inserts in mysql improve performance
_ caches can be optimized
_ even if you do not have unique rows, you may (or may not) be able to md5 the rows
I am trying to create a web application, primary objective is to insert request data into database.
Here is my problem, One request itself contains 10,000 to 1,00,000 data sets of information
(Each data set needs to be inserted separately as a row in the database)
I may get multiple request on this application concurrently, so its necessary for me to make the inserts fast.
I am using MySQL database, Which approach is better for me, LOAD DATA or BATCH INSERT or is there a better way than these two?
How will your application retrieve this information?
- There will be another background thread based java application that will select records from this table process them one by one and delete them.
Can you queue your requests (batches) so your system will handle them one batch at a time?
- For now we are thinking of inserting it to database straightaway, but yes if this approach is not feasible enough we may think of queuing the data.
Do retrievals of information need to be concurrent with insertion of new data?
- Yes, we are keeping it concurrent.
Here are certain answers to your questions, Ollie Jones
Thankyou!
Ken White's comment mentioned a couple of useful SO questions and answers for handling bulk insertion. For the record volume you are handling, you will enjoy the best success by using MyISAM tables and LOAD DATA INFILE data loading, from source files in the same file system that's used by your MySQL server.
What you're doing here is a kind of queuing operation. You receive these batches (you call them "requests") of records (you call them "data sets.) You put them into a big bucket (your MySQL table). Then you take them out of the bucket one at a time.
You haven't described your problem completely, so it's possible my advice is wrong.
Is each record ("data set") independent of all the others?
Does the order in which the records are processed matter? Or would you obtain the same results if you processed them in a random order? In other words, do you have to maintain an order on the individual records?
What happens if you receive two million-row batches ("requests") at approximately the same time? Assuming you can load ten thousand records a second (that's fast!) into your MySQL table, this means it will take 200 seconds to load both batches completely. Will you try to load one batch completely before beginning to load the second?
Is it OK to start processing and deleting the rows in these batches before the batches are completely loaded?
Is it OK for a record to sit in your system for 200 or more seconds before it is processed? How long can a record sit? (this is called "latency").
Given the volume of data you're mentioning here, if you're going into production with living data you may want to consider using a queuing system like ActiveMQ rather than a DBMS.
It may also make sense simply to build a multi-threaded Java app to load your batches of records, deposit them into a Queue object in RAM (a ConcurrentLinkedQueue instance may be suitable) and process them one by one. This approach will give you much more control over the performance of your system than you will have by using a MySQL table as a queue.