We have two Database i.e. DB-A and DB-B, we have almost more than 5000 tables in DB-A, we daily process our whole Database. Here Processing means we get the data from multiple tables of DB-A and then insert these data into some of the tables of DB-B, now after inserting these data into DB-B, we access these data many times because we need to process the whole data of DB-B. we access these data of DB-B whenever we need to process it, and we need to process it more than 500 time in a day and every time we access only that data which we need to process. Now since we are accessing these database(DB-B) multiple time it requires more than 2 hr time to get processed.
Now the problem is that i want to access the data from DB-A and then wants to process this data and then wants to insert this data into DB-B in one shot. but the constraint is that we have limited resources that is we have only 16 GB ram and we are not in the position to increase ram.
we have done indexing n all but still it is taking almost more than 2 hr time. please suggest me how can i reduce the processing time of this data ?
Related
I have a rather large (10M+) table with quite a lot data coming from IOT devices.
People only access the last 7 days of data but have access on demand to older data.
as this table is growing at a very fast pace (100k/day) I choose to split this table in 2 tables, one only holding the 7 last days of data, and another one with older data.
I have a cron running that basically takes the oldest data and moves it to the other table..
How could I tell Mysql to only keep the '7days' table loaded in memory to speed up read access and keep the 'archives' table on disk (ssd) for less frequent access ?
Is this something implemented in Mysql (Aurora) ?
I couldn't find anything in the docs besides in memory tables, but these are not what I'm after.
I have an ETL process which takes data from transaction db and keeps after processing stores the data to another DB. While storing the data we are truncating the old data and storing new data to have better performance, as update takes a lot of time than truncate insert. So in this process we experience counts as 0 or wrong data for some time (like for 2 3 mins). We are running the ETL in every 8 hours.
So how can we avoid this problem? How can we achieve zero downtime?
One way we did use in the past was to prepare the prod data in a table named temp. Then when finished (and checked, that was the lengthy part in our process), drop prod and rename the temp in prod.
Takes almost no time, and the process was successful even in case some other users were locking the table.
This is regarding SQL Server 2008 R2 and SSIS.
I need to update dozens of history tables on one server with new data from production tables on another server.
The two servers are not, and will not be, linked.
Some of the history tables have 100's of millions of rows and some of the production tables have dozens of millions of rows.
I currently have a process in place for each table that uses the following data flow components:
OLEDB Source task to pull the appropriate production data.
Lookup task to check if the production data's key already exists in the history table and using the "Redirect to error output" -
Transfer the missing data to the OLEDB Destination history table.
The process is too slow for the large tables. There has to be a better way. Can someone help?
I know if the servers were linked a single set based query could accomplish the task easily and efficiently, but the servers are not linked.
Segment your problem into smaller problems. That's the only way you're going to solve this.
Let's examine the problems.
You're inserting and/or updating existing data. At a database level, rows are packed into pages. Rarely is it an exact fit and there's usually some amount of free space left in a page. When you update a row, pretend the Name field went from "bob" to "Robert Michael Stuckenschneider III". That row needs more room to live and while there's some room left on the page, there's not enough. Other rows might get shuffled down to the next page just to give this one some elbow room. That's going to cause lots of disk activity. Yes, it's inevitable given that you are adding more data but it's important to understand how your data is going to grow and ensure your database itself is ready for that growth. Maybe, you have some non-clustered indexes on a target table. Disabling/dropping them should improve insert/update performance. If you still have your database and log set to grow at 10% or 1MB or whatever the default values are, the storage engine is going to spend all of its time trying to grow files and won't have time to actually write data. Take away: ensure your system is poised to receive lots of data. Work with your DBA, LAN and SAN team(s)
You have tens of millions of rows in your OLTP system and hundreds of millions in your archive system. Starting with the OLTP data, you need to identify what does not exist in your historical system. Given your data volumes, I would plan for this package to have a hiccup in processing and needs to be "restartable." I would have a package that has a data flow with only the business keys selected from the OLTP that are used to make a match against the target table. Write those keys into a table that lives on the OLTP server (ToBeTransfered). Have a second package that uses a subset of those keys (N rows) joined back to the original table as the Source. It's wired directly to the Destination so no lookup required. That fat data row flows on over the network only one time. Then have an Execute SQL Task go in and delete the batch you just sent to the Archive server. This batching method can allow you to run the second package on multiple servers. The SSIS team describes it better in their paper: We loaded 1TB in 30 minutes
Ensure the Lookup is a Query of the form SELECT key1, key2 FROM MyTable Better yet, can you provide a filter to the lookup? WHERE ProcessingYear = 2013 as there's no need to waste cache on 2012 if the OLTP only contains 2013 data.
You might need to modify your PacketSize on your Connection Manager and have a network person set up Jumbo frames.
Look at your queries. Are you getting good plans? Are your tables over-indexed? Remember, each index is going to result in an increase in the number of writes performed. If you can dump them and recreate after the processing is completed, you'll think your SAN admins bought you some FusionIO drives. I know I did when I dropped 14 NC indexes from a billion row table that only had 10 total columns.
If you're still having performance issues, establish a theoretical baseline (under ideal conditions that will never occur in the real world, I can push 1GB from A to B in N units of time) and work your way from there to what your actual is. You must have a limiting factor (IO, CPU, Memory or Network). Find the culprit and throw more money at it or restructure the solution until it's no longer the lagging metric.
Step 1. Incremental bulk import of appropriate proudction data to new server.
Ref: Importing Data from a Single Client (or Stream) into a Non-Empty Table
http://msdn.microsoft.com/en-us/library/ms177445(v=sql.105).aspx
Step 2. Use Merge Statement to identify new/existing records and operate on them.
I realize that it will take a significant amount of disk space on the new server, but the process would run faster.
I am trying to create a web application, primary objective is to insert request data into database.
Here is my problem, One request itself contains 10,000 to 1,00,000 data sets of information
(Each data set needs to be inserted separately as a row in the database)
I may get multiple request on this application concurrently, so its necessary for me to make the inserts fast.
I am using MySQL database, Which approach is better for me, LOAD DATA or BATCH INSERT or is there a better way than these two?
How will your application retrieve this information?
- There will be another background thread based java application that will select records from this table process them one by one and delete them.
Can you queue your requests (batches) so your system will handle them one batch at a time?
- For now we are thinking of inserting it to database straightaway, but yes if this approach is not feasible enough we may think of queuing the data.
Do retrievals of information need to be concurrent with insertion of new data?
- Yes, we are keeping it concurrent.
Here are certain answers to your questions, Ollie Jones
Thankyou!
Ken White's comment mentioned a couple of useful SO questions and answers for handling bulk insertion. For the record volume you are handling, you will enjoy the best success by using MyISAM tables and LOAD DATA INFILE data loading, from source files in the same file system that's used by your MySQL server.
What you're doing here is a kind of queuing operation. You receive these batches (you call them "requests") of records (you call them "data sets.) You put them into a big bucket (your MySQL table). Then you take them out of the bucket one at a time.
You haven't described your problem completely, so it's possible my advice is wrong.
Is each record ("data set") independent of all the others?
Does the order in which the records are processed matter? Or would you obtain the same results if you processed them in a random order? In other words, do you have to maintain an order on the individual records?
What happens if you receive two million-row batches ("requests") at approximately the same time? Assuming you can load ten thousand records a second (that's fast!) into your MySQL table, this means it will take 200 seconds to load both batches completely. Will you try to load one batch completely before beginning to load the second?
Is it OK to start processing and deleting the rows in these batches before the batches are completely loaded?
Is it OK for a record to sit in your system for 200 or more seconds before it is processed? How long can a record sit? (this is called "latency").
Given the volume of data you're mentioning here, if you're going into production with living data you may want to consider using a queuing system like ActiveMQ rather than a DBMS.
It may also make sense simply to build a multi-threaded Java app to load your batches of records, deposit them into a Queue object in RAM (a ConcurrentLinkedQueue instance may be suitable) and process them one by one. This approach will give you much more control over the performance of your system than you will have by using a MySQL table as a queue.
We have a data acquisition application with two primary modules interfacing with DB (via Hibernate) - one for writing the collected data into DB and one for reading/presenting the collected data from DB.
Average rate of inserts is 150-200 per second, average rate of selects is 50-80 per second.
Performance requirements for both writing/reading scenarios can be defined like this:
Writing into DB - no specific timings or performance requirements here, DB should be operating normally with 150-200 inserts per second
Reading from DB - newly collected data should be available to the user within 3-5 seconds timeframe after getting into DB
Please advice on the best approach for tuning the caching/buffering/operating policies of Hibernate for optimally supporting this scenario.
BTW - MySQL with InnoDB engine is being used underneath Hibernate.
Thanks.
P.S.: By saying "150-200 inserts per second" I mean an average rate of incoming data packets, not the actual amount of records being inserted into DB. But in any case - we should target here a very high rate of inserts per second.
I would read this chapter of the hibernate docs first.
And then consider the following
Inserting
Batch the inserts and do a few hundred per transaction. You say you can tolerate a delay of 3-5 seconds so this should be fine.
Selecting
Querying may already be ok at 50-80/second provided the queries are very simple
Index your data appropriately for common access patterns
You could try a second level cache in hibernate. See this chapter. Not done this myself so can't comment further.