I have a question.I'm working on loading a Very large table having existing data of the order of 150 million records which will keep on growing by adding 1 million records on a daily basis.Few days back the ETL started failing even after running for 24 hrs. In the DFT, we have source query pulling 1 million records which is LOOKed UP against the Destination table having 150 million records to check for new records. It is failing as the LOOKUP cannot hold data for 150 million records. I have tried changing the LOOKUP to Merge Join without success. Can you please suggest alternative designs to load the data in the large table successfully. Moreover, there is no way I can reduce the size of destination table. I already have indexes on all required columns. Hope I'm clear in explaining the scenario.
You can try table partitioning to import a large amount of data. Take a look here for an example. Another link that might be useful. Also, check msdn for more information about creating and maintaining partitioned tables.
Separate the journey into two.
Create a new data flow task (or separate package) to stage your source table "as is" in an environment close to your target database
Alter your existing data flow task to query across both your staged table and the target table to pull out only new (and changed?) records and process these accordingly within SSIS. This should eliminate the need to have the expensive look-up performing a round trip to the database per record
Related
I have a 120Go database with 1 specific very heavy table of 80Go (storing data since +10 years).
I think to move old data in archive, but wonder if it is best :
to move them in a new table in the same database
to move them in a new table of a new archive database
What would be the result on the performence point of view ?
1/ If I reduce the table to only 8Go and move 72Go in another table from the same database, is the database going to run faster (we won't access the archive table with read/write operations and r/W will be done on a lighter table).
2/ Keeping 72Go of data into the archive table will anyway slow down the database engine ?
3/ Having the 72Go of data into another archive database will have any benefit versus keeping the 72Go into the archive table of the master database ?
Thanks for your answers,
Edouard
The size of a table may or may not impact the performance of queries against that table. It depends on the query, innodb_buffer_pool_size and RAM size. Let's see some typical queries.
The existence of a big table that is not being used has no impact on queries against other tables.
It may or may not be wise to PARTITION BY RANGE TO_DAYS(...) and have monthly or yearly partitions. The main advantage is where you get around to purging old data, but you don't seem to need that.
If you do split into 72 + 8, I recommend copying the 8 from the 80 into a new table, then use RENAME TABLEs to juggle the table names.
Two TABLEs in one DATABASE is essentially the same as having the TABLEs in different DATABASEs.
I'll update this Answer when you have provided more details.
i have a tabular cube which takes a long time to processing, my idea is to process only new data every hour and a full process during the night, is there a way to do that with SSIS and SQL Job?
Assuming your "new rows" are inserts to your fact table rather than updates or deletes you can do a ProcessAdd operation. ProcessAdd will take a SQL query you provide that returns the new rows and add them to your table in SSAS Tabular.
There are several ways to automate this, all of which could be run from SSIS. This article walks through the options well.
If you have updates and deletes then you need to partition your table inside SSAS. For example partition by week then only reprocess (ProcessData) the partitions where any rows have been inserted/updated/deleted.
Hi I have around 1000 million record in production Table and around 1500 million record in Staging table,Here I need to compare staging data with production data if any new record then insert it and if there are some changes on record then update the column or insert a new column.So what is the best approach for this condition?Please suggest.
If you want to get a copy of a table from one database to another four times a year, one way is to simply truncate and reload the target table. Don't mess around trying to update individual rows, it's often quicker to delete the lot and reload them.
You need to do some analysis and see if there is any field you can use in the source to reliably load a subset of data. For example if there is a 'created date' on the record you can use that just load the data created in the last three months data (instead of all of it)
If you add some more info then we can be more specific about a solution.
i.e.
-This has to be fast but we have lots of disk space
-This must be simple to maintain and can take a day to load
Also... I assume the source and target is SQL Server? If so is it the Enterprise edition?
This is regarding SQL Server 2008 R2 and SSIS.
I need to update dozens of history tables on one server with new data from production tables on another server.
The two servers are not, and will not be, linked.
Some of the history tables have 100's of millions of rows and some of the production tables have dozens of millions of rows.
I currently have a process in place for each table that uses the following data flow components:
OLEDB Source task to pull the appropriate production data.
Lookup task to check if the production data's key already exists in the history table and using the "Redirect to error output" -
Transfer the missing data to the OLEDB Destination history table.
The process is too slow for the large tables. There has to be a better way. Can someone help?
I know if the servers were linked a single set based query could accomplish the task easily and efficiently, but the servers are not linked.
Segment your problem into smaller problems. That's the only way you're going to solve this.
Let's examine the problems.
You're inserting and/or updating existing data. At a database level, rows are packed into pages. Rarely is it an exact fit and there's usually some amount of free space left in a page. When you update a row, pretend the Name field went from "bob" to "Robert Michael Stuckenschneider III". That row needs more room to live and while there's some room left on the page, there's not enough. Other rows might get shuffled down to the next page just to give this one some elbow room. That's going to cause lots of disk activity. Yes, it's inevitable given that you are adding more data but it's important to understand how your data is going to grow and ensure your database itself is ready for that growth. Maybe, you have some non-clustered indexes on a target table. Disabling/dropping them should improve insert/update performance. If you still have your database and log set to grow at 10% or 1MB or whatever the default values are, the storage engine is going to spend all of its time trying to grow files and won't have time to actually write data. Take away: ensure your system is poised to receive lots of data. Work with your DBA, LAN and SAN team(s)
You have tens of millions of rows in your OLTP system and hundreds of millions in your archive system. Starting with the OLTP data, you need to identify what does not exist in your historical system. Given your data volumes, I would plan for this package to have a hiccup in processing and needs to be "restartable." I would have a package that has a data flow with only the business keys selected from the OLTP that are used to make a match against the target table. Write those keys into a table that lives on the OLTP server (ToBeTransfered). Have a second package that uses a subset of those keys (N rows) joined back to the original table as the Source. It's wired directly to the Destination so no lookup required. That fat data row flows on over the network only one time. Then have an Execute SQL Task go in and delete the batch you just sent to the Archive server. This batching method can allow you to run the second package on multiple servers. The SSIS team describes it better in their paper: We loaded 1TB in 30 minutes
Ensure the Lookup is a Query of the form SELECT key1, key2 FROM MyTable Better yet, can you provide a filter to the lookup? WHERE ProcessingYear = 2013 as there's no need to waste cache on 2012 if the OLTP only contains 2013 data.
You might need to modify your PacketSize on your Connection Manager and have a network person set up Jumbo frames.
Look at your queries. Are you getting good plans? Are your tables over-indexed? Remember, each index is going to result in an increase in the number of writes performed. If you can dump them and recreate after the processing is completed, you'll think your SAN admins bought you some FusionIO drives. I know I did when I dropped 14 NC indexes from a billion row table that only had 10 total columns.
If you're still having performance issues, establish a theoretical baseline (under ideal conditions that will never occur in the real world, I can push 1GB from A to B in N units of time) and work your way from there to what your actual is. You must have a limiting factor (IO, CPU, Memory or Network). Find the culprit and throw more money at it or restructure the solution until it's no longer the lagging metric.
Step 1. Incremental bulk import of appropriate proudction data to new server.
Ref: Importing Data from a Single Client (or Stream) into a Non-Empty Table
http://msdn.microsoft.com/en-us/library/ms177445(v=sql.105).aspx
Step 2. Use Merge Statement to identify new/existing records and operate on them.
I realize that it will take a significant amount of disk space on the new server, but the process would run faster.
I have a table with 4mil+ records. There is a staging table that gets updated with data via an ETL process throughout the day. Once the staging table gets updated, I need to then sync that data with the production table. I'm currently using an INSERT/ON DUPLICATE KEY UPDATE query to sync them, however with the size of this table it takes ~750 seconds to run. Is there a more efficient way to update/insert the new data? I have read some about partitioning tables, but I'm not sure if that's what I need to do or not. Can anyone give me some suggestions on how to do accomplish this more efficiently?
I would use the maatkit tools (http://www.maatkit.org/), specifically http://www.maatkit.org/doc/mk-table-sync.html. It is pretty efficient at this sort of thing.