I am trying to improve the performance of a SSIS Package.
One thing I got startted with is to filter the reference table of the Lookups. Until now, I was using a table as a reference table for that lookup.
First improvment was to change the table to a SQL clause that is selecting just the columns I need from that table.
Next, I want to load in this table just the records I know I'll use for sure. If I'm maintaining it in this state, I will get to load 300 000 lines or more (huge lines with binary content of around 500 kb each) and use just around 100 of them.
I would put some filters in the SQL query that sets the reference table of the lookup, BUT, in that filter I need to use ALL the ids of the rows loaded in my OLE DB source.
Is there any way to do this?
I thought of loading each row at a time using a OleDB Command instead of a Lookup, but except of beeing time consuming, I might get to load the same thing 100 times for 100 different rows, when I could just load it once in the lookup and use it 100 times...
Enableing the cache still would be another option that still doesn't sound very good, because it would slow us down - we are already terribly slow.
Any ideeas are greatly appreaciated.
One possibility is to first stream the distinct IDs to a permanent/temporary table in one data flow and then use it in your lookup (with a join) in a later data flow (you probably have to defer validation).
In many of our ETL packages, we first stream the data into a Raw file, handling all the type conversions and everything on the way there. Then, when all these conversions were successful, then we handle creating new dimensions and then the facts linking to the dimensions.
Related
I have a table, “tblData_Dir_Statistics_Detail” with 15 fields that is holding about 150,000 records. I can load the records into this table but, am having trouble updating other tables from this table (I want to use several different queries to update a couple different tables). There is only one index on the table and the only thing unusual about the table is there are 3 text fields that I run out to 255 characters because some of the paths\data are that long or even exceed 255. I have tried trim these to 150 characters but it has no impact on the correcting the problems that I am having using this table.
Additionally, I manually recreated the table because it acts like it is corrupted. Even this had no impact on the problems.
The original problem that I was getting is that my code would stop with a “System resource exceeded.”.
Here is the list of things I am experience and can’t seem to figure out why:
When I use the table in an update query (using task manager) I always see my Physical Memory usage for Access jump from about 35,000 K to 85,000 K instantly when the code hits this query and then, within a second or two, I get the resources exceed error.
Sometimes, but not all the time, when I compact and repair, tblData_Dir_Statistics_Detail is deleted by the process and is subsequently listed in MSysCompactError table as an error. The “ErrorCode” in the table is -1011 and the “ErrorDiscription” is “System resource exceeded.”
Sometimes, but not all the time, when I compact and repair, if I lose tblData_Dir_Statistics_Detail, I will lose the next one below it in the database window (shows also in the SYS table).
Sometimes, but not all the time, when I compact and repair, if I lose tblData_Dir_Statistics_Detail, I will lose the next TWO positions below it in the database window (shows also in the SYS table).
I have used table structures like this with much larger tables without problems for years. Additionally, I have a parallel table “tblData_Dir_Statistics” which has virtually the same structure and holds the same data but at a summarized level, and have no trouble with that or any other table.
Summary:
My suspicion is that there is some kind of character being imported into one of the fields that is corrupting this entire table.
If this is true, how could I find the corruption?
If it is not this, what else could it be?
Many Thanks!
A few considerations:
Access files have a size-limit of 2 GB. If your file becomes at any time bigger than 2 GB (even by 1 byte) the whole file is corrupted
Access creates temporary objects when sorting data and/or executing queries (and those temporary objects are created and stored in the file). Depending on the complexity of your queries, those temporary objects might be pushing the file size up (see previous paragraph).
If you are using text fields with lengths bigger than 255 characters, consider using Memo fields (these fields cannot be indexed as far as I remember, so be careful when using them)
Consider adding more indexes to your tables to ease and speed up the queries.
Consider dividing the database: Put all the data in one (or more) file(s) and link the tables in it (them) to another Access file, and execute the queries in this last one.
I have a problem where I need to have the dimension key for all rows in the data stream.
I use a lookup component to find the dimension key of the records
Records with no dimension key (lookup no match output) are redirect to a different output because they need to be inserted.
the no match output is multicated
the new records are inserted into the dimension.
a second lookup component should be executed after the records are inserted
number 5 fails because I don't know how to wait for the ADO NET Destination to finish...
Is there any way to solve this other than dump the stream into raw files and use other data flow to resume the task?
I think I understand what you're doing now. Normally you would load the dimension fully first in its own data flow, then after this is fully complete, you load the fact, using the already populated dimension. You appear to be trying to load the fact and dimension in one data flow.
The only reason you'd do this in one data flow is if you couldn't seperate your distinct dimensions from your facts and your fact source is so huge you don't want to go through it twice. Normally you can preload a dimension without a large fact source but this is not always the case. It depends on your source data.
You could use a SEQUENCE (http://technet.microsoft.com/en-us/library/ff878091.aspx) to do this in one data flow. This is a way of autogenerating a number without needing to insert it into a table, but your dimension would need to rely on a sequence instead of an identity. But you'd need to call it in some kind of inline script component or you might be able to trick a lookup component. It would be very slow.
Instead you should try building all of your dimensions in a prior load so that when you arrive at the fact load, all dimensions are already there.
In short the question is: Do you really need to do this in one big step? Can you prebuild your dimension in a prior data flow?
Thomas Kejser wrote a blogpost with a nice solution to this early arriving fact / late arriving dimension problem.
http://blogs.msdn.com/b/sqlcat/archive/2009/05/13/assigning-surrogate-keys-to-early-arriving-facts-using-integration-services.aspx
Basically you use a second lookup with a partial cache. Whenever the partial lookup cache receives a non-matched row it will call a SQL statement and fetch data to populate the lookup cache. If you use a stored proc in this SQL statement you can first add it to dimension table and then use the SELECT statement to alter the cache.
Im re working an existing PHP/MySql/JS/Ajax web app that processes a LARGE number of table rows for users. Here's how the page works currently.
A user uploads a LARGE csv file. The test one I'm working with has 400,000 rows, (each row has 5 columns).
Php creates a brand new table for this data and inserts the hundreds of thousands of rows.
The page then sorts / processes / displays this data back to the user in a useful way. Processing includes searching, sorting by date and other rows and re displaying them without a huge load time (thats where the JS/Ajax comes in).
My question is should this app be placing the data into a new table for each upload or into one large table with an id for each file? I think the origional developer was adding seperate tables for speed purposes. Speed is very important for this.
Is there a faster way? Is there a better mouse trap? Has anyone ever delt with this?
Remember every .csv can contain hundreds of thousands of rows and hundreds of .csv files can be uploaded daily. Though they can be deleted about 24 hrs after they were last used (Im thinking cron job any opinions?)
Thank you all!
A few notes based on comments:
All data is unique to each user and changes so the user wont be Re accessing this data after a couple of hours. Only if they accidentally close the window and then come right back would they really re visit for the same .csv.
No Foreign keys required all csv's are private to each user and dont need to be cross referenced.
I would shy away from putting all the data into a single table for the simple reason that you cannot change the data structure.
Since the data is being deleted anyway and you don't have a requirement to combine data from different loads, there isn't an obvious reason for putting the data into a single table. The other argument is that the application now works. Do you really want to discover some requirement down the road that implies separate tables after you've done the work?
If you do decide on a single table, then use table partitioning. Since each user is using their own data, you can use partitions to separate each user load into a separate partition. Although there are limits on partitions (such as no foreign keys), this will make access the data in a single table as fast as accessing the original data.
Given 105 rows and 102 CSVs per day, you're looking at 10 million rows per day (and you say you'll clear that data down regularly). That doesn't look like a scary figure for a decent db (especially given that you can index within tables, and not across multiple tables).
Obviously the most regularly used CSVs could be very easily held in memory for speed of access - perhaps even all of them (a very simple calculation based on next to no data gives me a figure of 1Gb if you flush every over 24 hours. 1Gb is not an unreasonable amount of memory these days)
I have a fact table that needs a join to a dimension table however obtaining that relationship from the source data isn't easy. The fact table is loaded from a source table that has around a million rows, so in accordance with best practice, I'm using a previous run date to only select the source rows that have been added since the previous run. After getting the rows I wish to load I need to go through 3 other tables in order to be able to do the lookup to the dimension table. Each of the 3 tables also has around a million rows.
I've read that best practice says not to extract source data that you know won't be needed. And best practice also says to have as light-touch as possible on the source system and therefore avoid sql joins. But in my case, those two best practices become mutually exlusive. If I only extract changed rows in the itermediary tables then I'll need to do a join in the source query. If I extract all the rows from the source system then I'm extracting much more data than I need and that may cause SSIS memory/performance problems.
I'm leaning towards a join in the extraction of the source data but I've been unable to find any discussions on the merits and drawbacks of that approach. Would that be correct or incorrect? (The source tables and the DW tables are in Oracle).
Can you stage the 3 source tables that you are referencing? You may not need them in the DW, but you could have them sitting in a staging database purely for this purpose. You would still need to keep these up-to-date however, but assuming you can just pull over the changes, this may not be too bad.
Here's our mission:
Receive files from clients. Each file contains anywhere from 1 to 1,000,000 records.
Records are loaded to a staging area and business-rule validation is applied.
Valid records are then pumped into an OLTP database in a batch fashion, with the following rules:
If record does not exist (we have a key, so this isn't an issue), create it.
If record exists, optionally update each database field. The decision is made based on one of 3 factors...I don't believe it's important what those factors are.
Our main problem is finding an efficient method of optionally updating the data at a field level. This is applicable across ~12 different database tables, with anywhere from 10 to 150 fields in each table (original DB design leaves much to be desired, but it is what it is).
Our first attempt has been to introduce a table that mirrors the staging environment (1 field in staging for each system field) and contains a masking flag. The value of the masking flag represents the 3 factors.
We've then put an UPDATE similar to...
UPDATE OLTPTable1 SET Field1 = CASE
WHEN Mask.Field1 = 0 THEN Staging.Field1
WHEN Mask.Field1 = 1 THEN COALESCE( Staging.Field1 , OLTPTable1.Field1 )
WHEN Mask.Field1 = 2 THEN COALESCE( OLTPTable1.Field1 , Staging.Field1 )
...
As you can imagine, the performance is rather horrendous.
Has anyone tackled a similar requirement?
We're a MS shop using a Windows Service to launch SSIS packages that handle the data processing. Unfortunately, we're pretty much novices at this stuff.
If you are using SQL Server 2008, look into the MERGE statement, this may be suitable for your Upsert needs here.
Can you use a Conditional Split for the input to send the rows to a different processing stage dependent upon the factor that is matched? Sounds like you may need to do this for each of the 12 tables but potentially you could do some of these in parallel.
I took a look at the merge tool, but I’m not sure it would allow for the flexibility to indicate which data source takes precedence based off of a predefined set of rules.
This function is critical to allow for a system that lets multiple members utilize the process that can have very different needs.
From what I have read the Merge function is more of a sorted union.
We do use an approach similar to what you describe in our product for external system inputs. (we handle a couple of hundred target tables with up to 240 columns) Like you describe, there's anywhere from 1 to a million or more rows.
Generally, we don't try to set up a single mass update, we try to handle one column's values at a time. Given that they're all a single type representing the same data element, the staging UPDATE statements are simple. We generally create scratch tables for mapping values and it's a simple
UPDATE target SET target.column = mapping.resultcolumn WHERE target.sourcecolumn = mapping.sourcecolumn.
Setting up the mappings is a little involved, but we again deal with one column at a time while doing that.
I don't know how you define 'horrendous'. For us, this process is done in batch mode, generally overnight, so absolute performance is almost never an issue.
EDIT:
We also do these in configurable-size batches, so the working sets & COMMITs are never huge. Our default is 1000 rows in a batch, but some specific situations have benefited from up to 40 000 row batches. We also add indexes to the working data for specific tables.