SSIS - Bulk Update at Database Field Level - ssis

Here's our mission:
Receive files from clients. Each file contains anywhere from 1 to 1,000,000 records.
Records are loaded to a staging area and business-rule validation is applied.
Valid records are then pumped into an OLTP database in a batch fashion, with the following rules:
If record does not exist (we have a key, so this isn't an issue), create it.
If record exists, optionally update each database field. The decision is made based on one of 3 factors...I don't believe it's important what those factors are.
Our main problem is finding an efficient method of optionally updating the data at a field level. This is applicable across ~12 different database tables, with anywhere from 10 to 150 fields in each table (original DB design leaves much to be desired, but it is what it is).
Our first attempt has been to introduce a table that mirrors the staging environment (1 field in staging for each system field) and contains a masking flag. The value of the masking flag represents the 3 factors.
We've then put an UPDATE similar to...
UPDATE OLTPTable1 SET Field1 = CASE
WHEN Mask.Field1 = 0 THEN Staging.Field1
WHEN Mask.Field1 = 1 THEN COALESCE( Staging.Field1 , OLTPTable1.Field1 )
WHEN Mask.Field1 = 2 THEN COALESCE( OLTPTable1.Field1 , Staging.Field1 )
...
As you can imagine, the performance is rather horrendous.
Has anyone tackled a similar requirement?
We're a MS shop using a Windows Service to launch SSIS packages that handle the data processing. Unfortunately, we're pretty much novices at this stuff.

If you are using SQL Server 2008, look into the MERGE statement, this may be suitable for your Upsert needs here.
Can you use a Conditional Split for the input to send the rows to a different processing stage dependent upon the factor that is matched? Sounds like you may need to do this for each of the 12 tables but potentially you could do some of these in parallel.

I took a look at the merge tool, but I’m not sure it would allow for the flexibility to indicate which data source takes precedence based off of a predefined set of rules.
This function is critical to allow for a system that lets multiple members utilize the process that can have very different needs.
From what I have read the Merge function is more of a sorted union.

We do use an approach similar to what you describe in our product for external system inputs. (we handle a couple of hundred target tables with up to 240 columns) Like you describe, there's anywhere from 1 to a million or more rows.
Generally, we don't try to set up a single mass update, we try to handle one column's values at a time. Given that they're all a single type representing the same data element, the staging UPDATE statements are simple. We generally create scratch tables for mapping values and it's a simple
UPDATE target SET target.column = mapping.resultcolumn WHERE target.sourcecolumn = mapping.sourcecolumn.
Setting up the mappings is a little involved, but we again deal with one column at a time while doing that.
I don't know how you define 'horrendous'. For us, this process is done in batch mode, generally overnight, so absolute performance is almost never an issue.
EDIT:
We also do these in configurable-size batches, so the working sets & COMMITs are never huge. Our default is 1000 rows in a batch, but some specific situations have benefited from up to 40 000 row batches. We also add indexes to the working data for specific tables.

Related

How to handle 20M+ records from tables with same structure in MySQL

I have to handle 25M rows of data that I have collected and transformed from about 50 different sources. Every source leads to about 500.000 to 600.000 rows. Each record has the same structure, regardless the source (let say: id, title, author, release_date)
For flexibility, I would prefer to create a dedicated table for each source, (then I can clear/drop data from a source and reload/upload data very quickly (using LOAD INFILE)). This way, it seems very easy to truncate the table with no risk to delete rows from other sources.
But then I don't know how to select records having the same author across the different tables, and cherry on the cake, with pagination (LIMIT keyword).
Is the only solution to store everything into a single huge table and deal with the pain of indexing/backuping a 25M+ database or is there a kind of abstract layer to virtually merge 50 tables into a virtual one.
It is probably a usual question for a dba, but I could not find any answer yet...
Any help/idea much appreicated. Thx
This might be a good spot for MySQL partitoning.
This lets you handle a big volume of data, while giving you the opportunity to run DML operations on a specific partition when needed (such as truncate, or event drop) very efficiently, and without impacting the rest of your data. Partitioning selection is also supported in LOAD DATA statements.
You can run queries across partitions as you would with a normal table, or target a specific partition when you need to (which can be done very efficiently).
In your specific use case, list partitioning seems like a relevant choice: you have a pre-defined list of sources, so you would typically have one partition per source.

update target table given DateCreated and DateUpdated columns in source table

What is the most efficient way of updating a target table given the fact that the source table contains a DateTimeCreated and DateTimeUpdated column?
I would like to keep the source in target in synch avoiding a
truncate. I am looking for a bets practice pattern in this situation
I'll avoid a best practice answer but give enough detail to make an appropriate choice. There are two main methods with which you might update a table in SSIS, avoiding a TRUNCATE - LOAD:
1) Use an OLEBD COMMAND
This method is good if:
you have a reliable DateTimeUpdated column,
there are not many rows to update,
there are not a lot of columns to update
there are not many added columns in the dataflow (i.e. derived column transforms)
and the update statement is fairly straightforward.
This method performs poorly with many columns because it performs a row-by-row update. Relying on an audit date column can be a great method to reduce the number of rows to update, but it can also cause problems if rows are updated in the source system and the audit column is not changed. I recommend only trusted it if it has a trigger or you can be certain that no human can perform updates on the table.
Additionally, this component falls short when there is a lot of columns to map or a lot of transforms going on in the data flow. For example, if you are converting all string columns from unicode to non-unicode, you may have many additional columns in the mix that will make mapping and maintenance a pain. The mapping tool in this component is good for about 10 columns, it starts to get confusing very quickly after that. Especially because you are mapping to numbered parameters rather than column names.
Lastly, if you are doing anything complex in the update statement, it is better suited for SQL code rather than maintaining it in the components editor which has no intellisense and is generally painful to use.
2) Stage the data and perform the update in Execute SQL task after the data flow
This method is good for all the reasons that the OLEDB command is bad for, but has some disadvantages as well. There is more code to maintain:
a couple of t-sql tasks,
a proc
and a staging table
This means also that it takes more time to set up as well. However, it does perform very well and the code is far easier to read and understand. Ongoing maintenance is simpler as well.
Please see my notes from this other question that I happened to answer today on the same subject: SSIS Compare tables content and update another

Loading a fact table in SSIS when obtaining the dimension key isn't easy

I have a fact table that needs a join to a dimension table however obtaining that relationship from the source data isn't easy. The fact table is loaded from a source table that has around a million rows, so in accordance with best practice, I'm using a previous run date to only select the source rows that have been added since the previous run. After getting the rows I wish to load I need to go through 3 other tables in order to be able to do the lookup to the dimension table. Each of the 3 tables also has around a million rows.
I've read that best practice says not to extract source data that you know won't be needed. And best practice also says to have as light-touch as possible on the source system and therefore avoid sql joins. But in my case, those two best practices become mutually exlusive. If I only extract changed rows in the itermediary tables then I'll need to do a join in the source query. If I extract all the rows from the source system then I'm extracting much more data than I need and that may cause SSIS memory/performance problems.
I'm leaning towards a join in the extraction of the source data but I've been unable to find any discussions on the merits and drawbacks of that approach. Would that be correct or incorrect? (The source tables and the DW tables are in Oracle).
Can you stage the 3 source tables that you are referencing? You may not need them in the DW, but you could have them sitting in a staging database purely for this purpose. You would still need to keep these up-to-date however, but assuming you can just pull over the changes, this may not be too bad.

MySQL structure for DBs larger than 10mm records

I am working with an application which has a 3 tables each with more than 10mm records and larger than 2GB.
Every time data is inserted there's at least one record added to each of the three tables and possibly more.
After every INSERT a script is launched which queries all these tables in order to extract data relevent to the last INSERT (let's call this the aggregation script).
What is the best way to divide the DB in smaller units and across different servers so that the load for each server is manageable?
Notes:
1. There are in excess of 10 inserts per second and hence the aggregation script is run the same number of times.
2. The aggregation script is resource intensive
3. The aggregation script has to be run on all the data in order to find which one is relevant to the last insert
4. I have not found a way of somehow dividing the DB into smaller units
5. I know very little about distributed DBs, so please use very basic terminology and provide links for further reading if possible
There are two answers to this from a database point of view.
Find a way of breaking up the database into smaller units. This is very dependent on the use of your database. This is really your best bet because it's the only way to get the database to look at less stuff at once. This is called sharding:
http://en.wikipedia.org/wiki/Shard_(database_architecture)
Have multiple "slave" databases in read only mode. These are basically copies of your database (with a little lag). For any read only queries where that lag is acceptable, they access these databases across the code in your entire site. This will take some load off of the master database you are querying. But, it will still be resource intensive on any particular query.
From a programming perspective, you already have nearly all your information (aside from ids). You could try to find some way of using that information for all your needs rather than having to requery the database after insert. You could have some process that only creates ids that you query first. Imagine you have tables A, B, C. You would have other tables that only have primary keys that are A_ids, B_ids, C_ids. Step one, get new ids from the id tables. Step two, insert into A, B, C and do whatever else you want to do at the same time.
Also, general efficiency/performance of all queries should be reviewed. Make sure you have indexes on anything you are querying. Do explain on all queries you are running to make sure they are using indexes.
This is really a midlevel/senior dba type of thing to do. Ask around your company and have them lend you a hand and teach you.

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.