Performance issues with lookup - ssis

i am facing a terrible issue with the SSIS packages. In my packages I have one lookup and i have a condition like this
OLEDB Source has around 400,000 records and lookup table has around 1,200,000 records. Both the tables can grow but the one which is coming from the OLEDB source will have max around 900,000. Both table has around 40-50 columns to lookup.
There are 51528912896 bytes of physical memory with 32689860608 bytes free. There are 4294836224 bytes of virtual memory with 249348096 bytes free. The paging file has 120246493184 bytes with 109932904448 bytes free.
is there is any effective solution to this?

At that scale I would be considering using a Merge Join transformation instead of a Lookup. Order by your keys in your OLE DB Source SQL code and define the sort manually (ref http://www.ssistalk.com/2009/09/17/ssis-avoiding-the-sort-components/ ). While slower than a cached lookup, this design tends to scale better in terms of memory use.

Related

Does 'Rows per batch' is SSIS OLE DB destination help reducing locking?

There is an option called 'Rows per batch' in OLE DB Destination, which, when specified, pulls a certain amount of rows within a batch, otherwise, pull all rows in the source in one batch.
Question: If my source and/or targer server are all highly OLTP database, will setting a low number on this parameter (for eg, 10k or 50k) help reducing lock escalation chance, so that the loading process can make minimal impact on either of the databases?
"Rows per batch" is actually more for tuning your data flow. By calculating the maximum width of a row, in bytes, and then dividing the default buffer size (the default is 10MB), you will get the number of rows you can insert in one "batch" without spilling the data out to tempdb or your buffer disk (depending if you set a specific location for you temp buffer). While keeping your data flow completely in memory and not needing to spill to disk, you'll keep your data transfer moving as quickly as possible.
The "Table Lock" option in the OLE DB Destination is what tells the server to lock the table or not.
In general the answer is: yes.
It also depends on the processing speed of the rows and the overhead per batch.
If your transaction with all the rows in a batch takes to long, consider splitting up. But splitting it up in too small batches can also yield performance problems.
The best way is to test and find the sweetspot.

MySQL Memory Engine vs InnoDB on RAMdisk

I'm writing a bit of software that needs to flatten data from a hierarchical type of format into tabular format. Instead of doing it all in a programming language every time and serving it up, I want to cache the results for a few seconds, and use SQL to sort and filter. When in use, we're talking 400,000 writes and 1 or 2 reads over the course of those few seconds.
Each table will contain 3 to 15 columns. Each row will contain from 100 bytes to 2,000 bytes of data, although it's possible that in some cases, some rows may get up to 15,000 bytes. I can clip data if necessary to keep things sane.
The main options I'm considering are:
MySQL's Memory engine
A good option, almost specifically written for my use case! But.. "MEMORY tables use a fixed-length row-storage format. Variable-length types such as VARCHAR are stored using a fixed length. MEMORY tables cannot contain BLOB or TEXT columns." - Unfortunately, I do have text fields with a length up to maybe 10,000 characters - and even that is a number that is not specifically limited. I could adjust the varchar length based on the max length of text columns as I loop through doing my flattening, but that's not totally elegant. Also, for my occasional 15,000 character row, does that mean I need to allocate 15,000 characters for every row in the database? If there was 100,000 rows, that's 1.3 gb not including overhead!
InnoDB on RAMDisk
This is meant to run on the cloud, and I could easily spin up a server with 16gb of ram, configure MySQL to write to tmpfs and use full featured MySQL. My concern for this is space. While I'm sure engineers have written the memory engine to prevent consuming all temp storage and crashing the server, I doubt this solution would know when to stop. How much actual space will my 2,000 bytes of data consume when in database format? How can I monitor it?
Bonus Questions
Indexes
I will in fact know in advance which columns need to be filtered and sorted by. I could set up an index before I do inserts, but what kind of performance gain could I honestly expect on top of a ram disk? How much extra overhead to indexes add?
Inserts
I'm assuming inserting multiple rows with one query is faster. But the one query, or series of large queries are stored in memory, and we're writing to memory, so if I did that I'd momentarily need double the memory. So then we talk about doing one or two or a hundred at a time, and having to wait for that to complete before processing more.. InnoDB doesn't lock the table but I worry about sending two queries too close to each other and confusing MySQL. Is this a valid concern? With the MEMORY engine I'd have to definitely wait for completion, due to table locks.
Temporary
Are there any benefits to temporary tables other than the fact that they're deleted when the db connection closes?
I suggest you use MyISAM. Create your table with appropriate indexes for your query. Then disable keys, load the table, and enable keys.
I suggest you develop a discipline like this for your system. I've used a similar discipline very effectively.
Keep two copies of the table. Call one table_active and the second one table_loading.
When it's time to load a new copy of your data, use commands like this.
ALTER TABLE table_loading DISABLE KEYS;
/* do your insertions here, to table_loading */
/* consider using LOAD DATA INFILE if it makes sense. */
ALTER TABLE table_loading ENABLE KEYS; /* this will take a while */
/* at this point, suspend your software that's reading table_active */
RENAME TABLE table_active TO table_old;
RENAME TABLE table_loading TO table_active;
/* now you can resume running your software */
TRUNCATE TABLE table_old;
RENAME TABLE table_old TO table_loading;
Alternatively, you can DROP TABLE table_old; and create a new table for table_loading instead of the last rename.
This two-table (double-buffered) strategy should work pretty well. It will create some latency because your software that's reading the table will work on an old copy. But you'll avoid reading from an incompletely loaded table.
I suggest MyISAM because you won't run out of RAM and blow up and you won't have the fixed-row-length overhead or the transaction overhead. But you might also consider MariaDB and the Aria storage engine, which does a good job of exploiting RAM buffers.
If you do use the MEMORY storage engine, be sure to tweak your max_heap_table_size system variable. If your read queries will use index range scans (sequential index access) be sure to specify BTREE style indexes. See here: http://dev.mysql.com/doc/refman/5.1/en/memory-storage-engine.html

SSIS to insert non-matching data on non-linked server

This is regarding SQL Server 2008 R2 and SSIS.
I need to update dozens of history tables on one server with new data from production tables on another server.
The two servers are not, and will not be, linked.
Some of the history tables have 100's of millions of rows and some of the production tables have dozens of millions of rows.
I currently have a process in place for each table that uses the following data flow components:
OLEDB Source task to pull the appropriate production data.
Lookup task to check if the production data's key already exists in the history table and using the "Redirect to error output" -
Transfer the missing data to the OLEDB Destination history table.
The process is too slow for the large tables. There has to be a better way. Can someone help?
I know if the servers were linked a single set based query could accomplish the task easily and efficiently, but the servers are not linked.
Segment your problem into smaller problems. That's the only way you're going to solve this.
Let's examine the problems.
You're inserting and/or updating existing data. At a database level, rows are packed into pages. Rarely is it an exact fit and there's usually some amount of free space left in a page. When you update a row, pretend the Name field went from "bob" to "Robert Michael Stuckenschneider III". That row needs more room to live and while there's some room left on the page, there's not enough. Other rows might get shuffled down to the next page just to give this one some elbow room. That's going to cause lots of disk activity. Yes, it's inevitable given that you are adding more data but it's important to understand how your data is going to grow and ensure your database itself is ready for that growth. Maybe, you have some non-clustered indexes on a target table. Disabling/dropping them should improve insert/update performance. If you still have your database and log set to grow at 10% or 1MB or whatever the default values are, the storage engine is going to spend all of its time trying to grow files and won't have time to actually write data. Take away: ensure your system is poised to receive lots of data. Work with your DBA, LAN and SAN team(s)
You have tens of millions of rows in your OLTP system and hundreds of millions in your archive system. Starting with the OLTP data, you need to identify what does not exist in your historical system. Given your data volumes, I would plan for this package to have a hiccup in processing and needs to be "restartable." I would have a package that has a data flow with only the business keys selected from the OLTP that are used to make a match against the target table. Write those keys into a table that lives on the OLTP server (ToBeTransfered). Have a second package that uses a subset of those keys (N rows) joined back to the original table as the Source. It's wired directly to the Destination so no lookup required. That fat data row flows on over the network only one time. Then have an Execute SQL Task go in and delete the batch you just sent to the Archive server. This batching method can allow you to run the second package on multiple servers. The SSIS team describes it better in their paper: We loaded 1TB in 30 minutes
Ensure the Lookup is a Query of the form SELECT key1, key2 FROM MyTable Better yet, can you provide a filter to the lookup? WHERE ProcessingYear = 2013 as there's no need to waste cache on 2012 if the OLTP only contains 2013 data.
You might need to modify your PacketSize on your Connection Manager and have a network person set up Jumbo frames.
Look at your queries. Are you getting good plans? Are your tables over-indexed? Remember, each index is going to result in an increase in the number of writes performed. If you can dump them and recreate after the processing is completed, you'll think your SAN admins bought you some FusionIO drives. I know I did when I dropped 14 NC indexes from a billion row table that only had 10 total columns.
If you're still having performance issues, establish a theoretical baseline (under ideal conditions that will never occur in the real world, I can push 1GB from A to B in N units of time) and work your way from there to what your actual is. You must have a limiting factor (IO, CPU, Memory or Network). Find the culprit and throw more money at it or restructure the solution until it's no longer the lagging metric.
Step 1. Incremental bulk import of appropriate proudction data to new server.
Ref: Importing Data from a Single Client (or Stream) into a Non-Empty Table
http://msdn.microsoft.com/en-us/library/ms177445(v=sql.105).aspx
Step 2. Use Merge Statement to identify new/existing records and operate on them.
I realize that it will take a significant amount of disk space on the new server, but the process would run faster.

Extract data with an OLE DB faster

Hi everyone I'm trying to extract a lot of records from a lot of joined tables and views using SSIS (OLE DB SOURCE) but it takes a huge time! the problem is due to the query because when I parsed it on sql server it takes more than hour ! Her's my ssis package design
I thought of paralleled extraction using two OLE DB source and merge join but it isn't recommended using it! besides it takes more time! Is there any way to help me please?
Writing the T-sql query with all the joins in the OLEDB source will always be faster than using different source and then using Merge Join IMHO. The reason is SSIS is memory Oriented architecture .It has to bring all the data from N different tables into its buffers and then filter it using Merge join and more over Merge Join is an asynchronous component(Semi Blocking) therefore it cannot use the same input buffer for its output .A new buffer is created and you may run out of memory if there are large number of rows extracted from the table.
Having said that there are few ways you can enhance the extraction performance using OLEDB source
1.Tune your SQL Query .Avoid using Select *
2.Check network bandwidth .You just cannot have faster throughput than your bandwidth supports.
3.All source adapters are asynchronous .The speed of an SSIS Source is not about how fast your query runs .It's about how fast the data is retrieved .
As others have suggested above ,you should show us the query and also the time it is taking to retireve the data else these are just few optimization technique which can make the extraction faster
Thank you for posting a screen shot of your data flow. I doubt whether the slowness you encounter is truly the fault of the OLE DB Source component.
Instead, you have 3 asynchronous components that result in a 2 full blocks of your data flow and one that's partially blocking (AGG, SRT, MRJ). That first aggregate will have to wait for all 500k rows to arrive before it can finish the aggregate and then pass it along to the sort.
These transformations also result in fragmented memory. Normally, a memory buffer is filled with data and visits each component in a data flow. Any changes happen directly to that address space and the engine can parallelize operations if it can determine step 2 is modifying field X and step 3 is modifying Y. The async components are going to cause data to be copied from one space to another. This is a double slow down. The first is the physical act of copying data from address space 0x01 to 0xFA or something. The second is that it reduces the available amount of memory for the dtexec process. No longer can SSIS play with all N gigs of memory. Instead, you'll have quartered your memory and after each async is done, that memory partition is just left there until the data flow completes.
If you want this run better, you'll need to fix your query. It may result in your aggregated data being materialized into a staging table or all in one big honkin' query.
Open a new question and provide insight into the data structures, indexes, data volumes, the query itself and preferably the query plan - estimated or actual. If you need help identifying these things, there are plenty of helpful folks here that can help you through the process.

Access MDB: do access MDB files have an upper size limit?

Do access .MDB files have an upper size limit? Will, for example, applications that connect to an .MDB file over 1GB have problems?
What problems/risks are there with MDB files over 1GB and what can be done?
Yes. The maximum size for an MDB file is 2 GB and yes any file over 1 GB is really pushing Access.
Original (now broken) link Access Database Specifications
See Access Database Specifications for more. (Wayback machine)
You may find data retrieval painfully slow with a large Access database. Indexing can reduce the pain considerably. For example if you have a query which includes "WHERE somefield = 27", data retrieval can be much faster if you create an index on somefield. If you have little experience with indexing, try the Performance Analyzer tool to get you started. In Access 2003 the Performance Analyzer is available from Tools -> Analyze -> Performance. I'm not sure about other Access versions.
One caveat about indexes is that they add overhead for Insert, Update, and Delete operations because the database engine must revise indexes in addition to the table where the changes occur. So if you were to index everything, you would likely make performance worse.
Try to limit the amount of data your client application retrieves from the big database. For example with forms, don't use a table as the form's data source. Instead create a query which returns only one or a few rows, and use the query as the form's data source. Give the user a method to select which record she wants to work on and retrieve only that record.
Your didn't mention whether you have performed Compact and Repair. If not, try it; it could shrink the size of your database considerably. In addition to reclaiming unused space, compact also updates the index statistics which helps the database engine determine how to access data more efficiently.
Tony Toews has more information about Access performance you may find useful, though it's not specific to large databases. See Microsoft Access Performance FAQ
If you anticipate bumping up against the 2 GB limit for MDB files, consider moving your data into SQL Server. The free Express version also limits the amount of data you can store, but is more generous than Access. SQL Server Express R2 will allow you to store 10 GB. Actually I would probably move to SQL Server well before Access' 2 GB limit.
2 GB total for all objects in the database
Total number of objects in a database 32,768