Extract data with an OLE DB faster - ssis

Hi everyone I'm trying to extract a lot of records from a lot of joined tables and views using SSIS (OLE DB SOURCE) but it takes a huge time! the problem is due to the query because when I parsed it on sql server it takes more than hour ! Her's my ssis package design
I thought of paralleled extraction using two OLE DB source and merge join but it isn't recommended using it! besides it takes more time! Is there any way to help me please?

Writing the T-sql query with all the joins in the OLEDB source will always be faster than using different source and then using Merge Join IMHO. The reason is SSIS is memory Oriented architecture .It has to bring all the data from N different tables into its buffers and then filter it using Merge join and more over Merge Join is an asynchronous component(Semi Blocking) therefore it cannot use the same input buffer for its output .A new buffer is created and you may run out of memory if there are large number of rows extracted from the table.
Having said that there are few ways you can enhance the extraction performance using OLEDB source
1.Tune your SQL Query .Avoid using Select *
2.Check network bandwidth .You just cannot have faster throughput than your bandwidth supports.
3.All source adapters are asynchronous .The speed of an SSIS Source is not about how fast your query runs .It's about how fast the data is retrieved .
As others have suggested above ,you should show us the query and also the time it is taking to retireve the data else these are just few optimization technique which can make the extraction faster

Thank you for posting a screen shot of your data flow. I doubt whether the slowness you encounter is truly the fault of the OLE DB Source component.
Instead, you have 3 asynchronous components that result in a 2 full blocks of your data flow and one that's partially blocking (AGG, SRT, MRJ). That first aggregate will have to wait for all 500k rows to arrive before it can finish the aggregate and then pass it along to the sort.
These transformations also result in fragmented memory. Normally, a memory buffer is filled with data and visits each component in a data flow. Any changes happen directly to that address space and the engine can parallelize operations if it can determine step 2 is modifying field X and step 3 is modifying Y. The async components are going to cause data to be copied from one space to another. This is a double slow down. The first is the physical act of copying data from address space 0x01 to 0xFA or something. The second is that it reduces the available amount of memory for the dtexec process. No longer can SSIS play with all N gigs of memory. Instead, you'll have quartered your memory and after each async is done, that memory partition is just left there until the data flow completes.
If you want this run better, you'll need to fix your query. It may result in your aggregated data being materialized into a staging table or all in one big honkin' query.
Open a new question and provide insight into the data structures, indexes, data volumes, the query itself and preferably the query plan - estimated or actual. If you need help identifying these things, there are plenty of helpful folks here that can help you through the process.

Related

Redshift design or configuration issue? - My Redshift datawarehouse seems much slower than my mysql database

I have a Redshift datawarehouse that is pulling data in from multiple sources.
One is my from MySQL and the others are some cloud based databases that get pulled in.
When querying in redshift, the query response is significantly slower than the same mysql table(s).
Here is an example:
SELECT *
FROM leads
WHERE id = 10162064
In mysql this takes .4 seconds. In Redshift it takes 4.4 seconds.
The table has 11 million rows. "id" is indexed in mysql and in redshift it is not since it is a columnar system.
I know that Redshift is a columnar data warehouse (which is relatively new to me) and Mysql is a relational database that is able to utilize indexes. I'm not sure if Redshift is the right tool for us for reporting, or if we need something else. We have about 200 tables in it from 5 different systems and it is currently at 90 GB.
We have a reporting tool sitting on top that does native queries to pull data. They are pretty slow but are also pulling a ton of data from multiple tables. I would expect some slowness with these, but with a simple statement like above, I would expect it to be quicker.
I've tried some different DIST and SORT key configurations but see no real improvement.
I've run vacuum and analyze with no improvement.
We have 4 nodes, dc2.large. Currently only using 14% storage. CPU utilization is frequently near 100%. Database connections averages about 10 at any given time.
The datawarehouse just has exact copies of the tables from our integration with the other sources. We are trying to do near real-time reporting with this.
Just looking for advice on how to improve performance of our redshift via configuration changes, some sort of view or dim table architecture, or any other tips to help me get the most out of redshift.
I've worked with clients on this type of issue many times and I'm happy to help but this may take some back and forth to narrow in on what is happening.
First I'm assuming that "leads" is a normal table, not a view and not an external table. Please correct if this assumption isn't right.
Next I'm assuming that this table isn't very wide and that "select *" isn't contributing greatly to the speed concern. Yes?
Next question is wide this size of cluster for a table of only 11M rows? I'd guess it is that there are other much larger data sets on the database and that this table isn't setting the size.
The first step of narrowing this down is to go onto the AWS console for Redshift and find the query in question. Look at the actual execution statistics and see where the query is spending its time. I'd guess it will be in loading (scanning) the table but you never know.
You also should look at STL_WLM_QUERY for the query in question and see how much wait time there was with the running of this query. Queueing can take time and if you have interactive queries that need faster response times then some WLM configuration may be needed.
It could also be compile time but given the simplicity of the query this seems unlikely.
My suspicion is that the table is spread too thin around the cluster and there are lots of mostly empty blocks being read but this is just based on assumptions. Is "id" the distkey or sortkey for this table? Other factors likely in play are cluster load - is the cluster busy when this query runs? WLM is one place that things can interfere but disk IO bandwidth is a share resource and if some other queries are abusing the disks this will make every query's access to disk slow. (Same is true of network bandwidth and leader node workload but these don't seem to be central to your issue at the moment.)
As I mentioned resolving this will likely take some back and forth so leave comments if you have additional information.
(I am speaking from a knowledge of MySQL, not Redshift.)
SELECT * FROM leads WHERE id = 10162064
If id is indexed, especially if it is a Unique (or Primary) key, 0.4 sec sounds like a long network delay. I would expect 0.004 as a worst-case (with SSDs and `PRIMARY KEY(id)).
(If leads is a VIEW, then let's see the tables. 0.4s may be be reasonable!)
That query works well for a RDBMS, but not for a columnar database. Face it.
I can understand using a columnar database to handle random queries on various columns. See also MariaDB's implementation of "Columnstore" -- that would give you both RDBMS and Columnar in a single package. Still, they are separate enough that you can't really intermix the two technologies.
If you are getting 100% CPU in MySQL, show us the query, its EXPLAIN, and SHOW CREATE TABLE. Often, a better index and/or query formulation can solve that.
For "real time reporting" in a Data Warehouse, building and maintaining Summary Tables is often the answer.
Tell us more about the "exact copy" of the DW data. In some situations, the Summary tables can supplant one copy of the Fact table data.

Long time open database connection and SQL join query memory

I'm currently working on a java application which performs following in a background thread.
opens database connection
Select some rows (100000+ rows)
perform a long running task for each row by calling ResultSet.next() with some buffer size defined by resultSet.setFetchSize()
finally after everything's done closes the connection
If the query does some sorting or joining it will create a temp table and will have some additional memory usage. My question is if my database connection is being opened for long time (let's say few hours) and fetch batch by batch slowly, will it cause performance trouble's in database due to memory usage? (If the database is concurrently used by other threads also.) Or databases are designed to handle these things effectively?
(In the context of both MySQL and Oracle)
From an Oracle perspective, opening a cursor and fetching from it periodically doesn't have that much of an impact if it's left open... unless the underlying data that the cursor is querying against changes since the query was first started.
If so, the Oracle database now has to do additional work to find the data as it was at the start of the query (since read-consistency!), so now it needs to query the data blocks (either on disk or from the buffer cache) and, in the event the data has changed, the undo tablespace.
If the undo tablespace is not sized appropriately and enough data has changed, you may find that your cursor fetches fail with an "ORA-01555: snapshot too old" exception.
In terms of memory usage, a cursor doesn't open a result set and store it somewhere for you; it's simply a set of instructions to the database on how to get the next row that gets executed when you do a fetch. What gets stored in memory is that set of instructions, which is relatively small when compared to the amount of data it can return!
this mechanism seems not good.
although both mysql(innodb engine) and oracle provides consistent read for select,
do such a long select may leads to performance downgrade due to build cr block and other work,
even ora-01555 in oracle.
i think you should query/export all data first,
then process the actual business one by one.
at last, query all data first will not reduce the memory usage,
but reduce the continus time for memory and temp sort segment/file usage.
or you shoud consider separete the whole work to small pieces,
this is better.

SSIS to insert non-matching data on non-linked server

This is regarding SQL Server 2008 R2 and SSIS.
I need to update dozens of history tables on one server with new data from production tables on another server.
The two servers are not, and will not be, linked.
Some of the history tables have 100's of millions of rows and some of the production tables have dozens of millions of rows.
I currently have a process in place for each table that uses the following data flow components:
OLEDB Source task to pull the appropriate production data.
Lookup task to check if the production data's key already exists in the history table and using the "Redirect to error output" -
Transfer the missing data to the OLEDB Destination history table.
The process is too slow for the large tables. There has to be a better way. Can someone help?
I know if the servers were linked a single set based query could accomplish the task easily and efficiently, but the servers are not linked.
Segment your problem into smaller problems. That's the only way you're going to solve this.
Let's examine the problems.
You're inserting and/or updating existing data. At a database level, rows are packed into pages. Rarely is it an exact fit and there's usually some amount of free space left in a page. When you update a row, pretend the Name field went from "bob" to "Robert Michael Stuckenschneider III". That row needs more room to live and while there's some room left on the page, there's not enough. Other rows might get shuffled down to the next page just to give this one some elbow room. That's going to cause lots of disk activity. Yes, it's inevitable given that you are adding more data but it's important to understand how your data is going to grow and ensure your database itself is ready for that growth. Maybe, you have some non-clustered indexes on a target table. Disabling/dropping them should improve insert/update performance. If you still have your database and log set to grow at 10% or 1MB or whatever the default values are, the storage engine is going to spend all of its time trying to grow files and won't have time to actually write data. Take away: ensure your system is poised to receive lots of data. Work with your DBA, LAN and SAN team(s)
You have tens of millions of rows in your OLTP system and hundreds of millions in your archive system. Starting with the OLTP data, you need to identify what does not exist in your historical system. Given your data volumes, I would plan for this package to have a hiccup in processing and needs to be "restartable." I would have a package that has a data flow with only the business keys selected from the OLTP that are used to make a match against the target table. Write those keys into a table that lives on the OLTP server (ToBeTransfered). Have a second package that uses a subset of those keys (N rows) joined back to the original table as the Source. It's wired directly to the Destination so no lookup required. That fat data row flows on over the network only one time. Then have an Execute SQL Task go in and delete the batch you just sent to the Archive server. This batching method can allow you to run the second package on multiple servers. The SSIS team describes it better in their paper: We loaded 1TB in 30 minutes
Ensure the Lookup is a Query of the form SELECT key1, key2 FROM MyTable Better yet, can you provide a filter to the lookup? WHERE ProcessingYear = 2013 as there's no need to waste cache on 2012 if the OLTP only contains 2013 data.
You might need to modify your PacketSize on your Connection Manager and have a network person set up Jumbo frames.
Look at your queries. Are you getting good plans? Are your tables over-indexed? Remember, each index is going to result in an increase in the number of writes performed. If you can dump them and recreate after the processing is completed, you'll think your SAN admins bought you some FusionIO drives. I know I did when I dropped 14 NC indexes from a billion row table that only had 10 total columns.
If you're still having performance issues, establish a theoretical baseline (under ideal conditions that will never occur in the real world, I can push 1GB from A to B in N units of time) and work your way from there to what your actual is. You must have a limiting factor (IO, CPU, Memory or Network). Find the culprit and throw more money at it or restructure the solution until it's no longer the lagging metric.
Step 1. Incremental bulk import of appropriate proudction data to new server.
Ref: Importing Data from a Single Client (or Stream) into a Non-Empty Table
http://msdn.microsoft.com/en-us/library/ms177445(v=sql.105).aspx
Step 2. Use Merge Statement to identify new/existing records and operate on them.
I realize that it will take a significant amount of disk space on the new server, but the process would run faster.

Executing SSIS package creates huge no. of temp files which makes me run out of disk space

I have a ssis package which I run using an sql job for bulk copy of data from one database to other. the destination is our integration server where we have enough space for database. But when i run this job (i.e package). it creates huge number of temp files in localsettings/temp folder in orders of for a 1GB mdf file it creates some 20gb of temp files. I have manually created this package and didnot use import export wizard. Can any one help me how to avoid this huge tempfiles while executing?.If any further details needed plese mention.
Note: many said if we create a package using import export wizard and set optimize for many tables true this will happen. But in this package i query only one table and have created manually without using import export wizard.
Why is the package creating temp files?
SSIS is an in-memory ETL solution, except when it can't keep everything in memory and begins swapping to disk.
Why would restructuring the package as #jeff hornby suggested help?
Fully and partially blocking transformations force memory copies in your data flow. Assume you have 10 buckets carrying 1MB of data each. When you use a blocking transformation, as those buckets arrive at a transformation the data has to be copied from one memory location to another one. You've now doubled your packages total memory consumption as you have 10MB of data used before the union all transformation and then another 10MB after it.
Only use columns that you need. If a column is not in your destination, don't add it to the data flow. Use the database to perform sorts and merges. Cast your data to the appropriate types before it ever hits the data flow.
What else could be causing temp file usage
Lookup transformations. I've seen people crush their ETL server when they use SELECT * FROM dbo.BillionRowTable when all they needed was one or two columns for the current time period. The default behaviour of a lookup operation is to execute that source query and cache the results in memory. For large tables, wide and/or deep, this can make it look like your data flow isn't even running as SSIS is busy streaming and caching all of that data as part of the pre-execute phase.
Binary/LOB data. Have an (n)varchar(max)/varbinary(max) or classic BLOB data type in your source table? Sorry, that's not going to be in memory. Instead, the data flow is going to carry a pointer along and write a file out for each one of those objects.
Too much parallel processing. SSIS is awesome in that you get free paralleization of your proessing. Except you can have too much of a good thing. If you have 20 data flows all floating in space with no precedence between them, the Integration Services engine may try to run all of them at once. Add a precedence constraint between them, even if it's just on completion (on success/on fail) to force some serialization of operations. Inside a data flow, you can introduce the same challenge by having unrelated operations going on. My rule of thumb is that starting at any source or destination, I should be able to reach all the other source/destinations.
What else can I do?
Examine what else is using memory on the box. Have you set a sane (non-default) maximum memory value for SQL Server? SSIS like RAM like a fat kid loves cake so you need to balance the memory needs of SSIS against the database itself-they have completely separate memory spaces.
Each data flow has the ability to set the [BufferTempStoragePath and BlobTempStoragePath2. Take advantage of this and put that on a drive with sufficient storage
Finally, add more RAM. If you can't make the package better by doing the above, throw more hardware at it and be done.
If you're getting that many temp files, then you probably have a lot of blocking transforms in your data flow. Try to eliminate the following types of transformations: Aggregate, Fuzzy Grouping, Fuzzy Lookup, Row Sampling, Sort, Term Extraction. Also, partially blocking transactions can create the same problems but not in the same scale: Data Mining Query, Merge, Merge Join, Pivot, Term Lookup, Union All, Unpivot. You might want to try to minimize these transformations.
Probably the problem is a sort transformation somewhere in your data flow (this is the most common). You might be able to eliminate this by using an ORDER BY clause in your SQL statement. Just remember to set the sorted property in the data source.

Can I use multiple servers to increase mysql's data upload performance?

I am in the process of setting up a mysql server to store some data but realized(after reading a bit this weekend) I might have a problem uploading the data in time.
I basically have multiple servers generating daily data and then sending it to a shared queue to process/analyze. The data is about 5 billion rows(although its very small data, an ID number in a column and a dictionary of ints in another). Most of the performance reports I have seen have shown insert speeds of 60 to 100k/second which would take over 10 hours. We need the data in very quickly so we can work on it that day and then we may discard it(or achieve the table to S3 or something).
What can I do? I have 8 servers at my disposal(in addition to the database server), can I somehow use them to make the uploads faster? At first I was thinking of using them to push data to the server at the same time but I'm also thinking maybe I can load the data onto each of them and then somehow try to merge all the separated data into one server?
I was going to use mysql with innodb(I can use any other settings it helps) but its not finalized so if mysql doesn't work is there something else that will(I have used hbase before but was looking for a mysql solution first in case I have problems seems more widely used and easier to get help)?
Wow. That is a lot of data you're loading. It's probably worth quite a bit of design thought to get this right.
Multiple mySQL server instances won't help with loading speed. What will make a difference is fast processor chips and very fast disk IO subsystems on your mySQL server. If you can use a 64-bit processor and provision it with a LOT of RAM, you may be able to use a MEMORY access method for your big table, which will be very fast indeed. (But if that will work for you, a gigantic Java HashMap may work even better.)
Ask yourself: Why do you need to stash this info in a SQL-queryable table? How will you use your data once you've loaded it? Will you run lots of queries that retrieve single rows or just a few rows of your billions? Or will you run aggregate queries (e.g. SUM(something) ... GROUP BY something_else) that grind through large fractions of the table?
Will you have to access the data while it is incompletely loaded? Or can you load up a whole batch of data before the first access?
If all your queries need to grind the whole table, then don't use any indexes. Otherwise do. But don't throw in any indexes you don't need. They are going to cost you load performance, big time.
Consider using myISAM rather than InnoDB for this table; myISAM's lack of transaction semantics makes it faster to load. myISAM will do fine at handling either aggregate queries or few-row queries.
You probably want to have a separate table for each day's data, so you can "get rid" of yesterday's data by either renaming the table or simply accessing a new table.
You should consider using the LOAD DATA INFILE command.
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
This command causes the mySQL server to read a file from the mySQL server's file system and bulk-load it directly into a table. It's way faster than doing INSERT commands from a client program on another machine. But it's also tricker to set up in production: your shared queue needs access to the mySQL server's file system to write the data files for loading.
You should consider disabling indexing, then loading the whole table, then re-enabling indexing, but only if you don't need to query partially loaded tables.