SSIS Warm Caches - ssis

Please see the following webpage: http://msdn.microsoft.com/en-us/library/ms137786.aspx. In the 'Caches of Indexes and Reference Tables' it states:
"When you configure the Fuzzy Lookup transformation, you can specify
whether the transformation partially caches the index and reference
table in memory before the transformation does its work. If you set
the WarmCaches property to True, the index and reference table are
loaded into memory. When the input has many rows, setting the
WarmCaches property to True can improve the performance of the
transformation. When the number of input rows is small, setting the
WarmCaches property to False can make the reuse of a large index
faster. "
Where do you set Warm Caches? I have tried Googling this. I have also looked in the Properties of the component.
Does 'Warm Caches' mean the component will use Fuzzy Groups that were created on a previous run of the SSIS package?

the answer was below from the above BOL:
At run time, the Fuzzy Lookup transformation creates temporary
objects, such as tables and indexes, in the SQL Server database that
the transformation connects to. The size of these temporary tables and
indexes is proportionate to the number of rows and tokens in the
reference table and the number of tokens that the Fuzzy Lookup
transformation creates; therefore, they could potentially consume a
significant amount of disk space.
This implies that if we select warm caches the temporary tables create are loading into the SSIS memory space similar to a fully cached lookup transform
The warm cache setting can be foun in the properties tab in bids:
"Pretty Sneaky SSIS"

Related

What is the "Buffer Size" & "Max Rows" Relationship to Multiple Source and Destination flows "Rows Per Batch" and "Insert Commit Size"?

I have a Data Flow Task that moves a bunch of data from multiple sources to multiple destinations. About 50 in all. The data moved is from one database to another with varying rows and columns in each flow.
While I believe I understand the basic idea behind the Data Flow Task's DefaultBufferMaxRows and DefaultBufferSize as it relates to Rows per Batch and Maximum insert commit size of the Destination, it's not clear to me what happens when there are multiple unrelated source and destination flows.
What I'm wondering is which of the following makes the most sense :
Divide out all the source and destination flows into separate Data Flow Tasks
Divide them into groups that have roughly the same size and number of rows
Leave as is and just make sure to set the properties with enough Buffer Rows and Buffer Size while setting the Rows per batch and Maximum insert commit size to the individual destination
I believe I read some place that it's better to have each source and destination in it's own data flow task, but I am unable to find the link at this time.
Most examples I've been able to locate online seem to always be for one source to one or more destinations, or just one to one.
Let me go from the basis. Data Flow Task is a task, organizing a pipeline of data from Data Source to Data Destination. It is a unique task in SSIS because it runs data manipulation in SSIS itself, all other tasks call external systems to do something with data out of SSIS.
On the relationships between DefaultBufferMaxRows, DefaultBufferSize as it relates to Rows per Batch and Maximum insert commit size of the Destination. There is no direct relation. DefaultBufferMaxRows and DefaultBufferSize are properties of Data Flow pipeline; the pipeline processes rows in batches and these properties controls the processing batch size. These properties control RAM consumption and performance of Data Flow Task.
On other hand, Rows per Batch and Maximum insert commit size are the properties of Data Destination, namely OLE DB Destination in Fast Load mode only; it controls performance of Data Destination itself. You may have a Data Flow with Flat File Destination where you do not have Rows per Batch, but it will definitely have DefaultBufferMaxRows and DefaultBufferSize properties.
Typical usage from my experience:
DefaultBufferMaxRows and DefaultBufferSize control batch size of Data Flow pipeline. Tuning it is a tradeoff - bigger batches means less overhead on batch handling i.e. less execution time, but more RAM consumption. More RAM means that you might experience outage of RAM and DFT data buffers will be swapped to Disk.
In SSIS 2016+ there is a "magical setting" AutoAdjustBufferSize which tells the engine to autogrow the buffer.
Values for these properties are usually defined at performance tests in QA environment. On development - use the defaults.
Rows per Batch and Maximum insert commit size -- control log growth and possibility to rollback all changes. Do not change these unless you really need to do so. Defaults are generally Ok; I changed it rarely on special reason. More on its functions.
On package design:
1 pair of Source-Destination per DFT (Data Flow Task). This is optimal - gives you most of control in terms of tuning and execution order etc. Also you can utilize parallel execution of tasks by SSIS engine. BTW, it simplifies debugging and support.
Division in groups. You can group DFT in Sequence groups and define common properties via Expressions-Variables. But - use it if you really need to do so because it complicates your design.
All Source-Destination in one DFT. I would recommend against it, complex and error prone.
As a bottom line, keep it simple -- 1 pair of Source-Destination per DFT, and play with your parameters only if have to do so.

How to dynamically set Cache mode in Lookup

The first time our SSIS packages run they do a full load, and incremental loads afterwards. All Lookups are using Full cache, but it may not be handy for incremental loads, as some of the Lookup tables contain millions of records, and the incremental load may be small.
Is it possible to dynamically set, based on some parameter, whether a Lookup should use Full Cache, Partial Cache or No Cache?
Solution
Because the database and SSIS packages are on the same server, partial cache with indexes on the lookup columns is as fast as full cache for full loads, and even faster for incremental loads.
I can't see a way to do this, but potential alternatives would be:
Build two different data flow tasks, one for each cache type, and then use a variable/logic to decide which one to run by setting expressions on your precedence constraints.
Use an SQL statement from a variable as your Lookup connection, and select only the records you need, e.g., if it's based on a date range or something like that. You could then build the SQL statement before each execution.
Here is a trick I am using. Create 2 lookups. First is a lookup with full cache and redirect on no match. Its no match branch goes into the second lookup that is exactly the same but in partial cache mode. The no match policy on this one is what you actually need. Match results from both lookups combine back together in a union all.
Now the tricky part. You go to out of the dataflow to control flow that contains it. When you select that dataflow element, in its properties you can find expressions. In those expressions you find the following looking entry [].[SqlCommand]. There you should build an expression for your lookup that will add a false condition to it in case when you want to use the partial lookup.
Here is a simple example. Let's say you have IsFullLookup boolean variable that you set to true when you are having a large initial load, and to false on small incremental loads. And let's say your lookup query looks the following way.
SELECT Value, Key FROM MyLookupTable
Then the expression may look something like this
"SELECT Value, Key FROM MyLookupTable" + ( #[User::IsFullLookup] ? "" : " WHERE 0=1" )
How this works. If your IsFullLookup variable is true then the query remains unchanged. It performs full load to cache, and all matching values go straight to union all. Non-matching values go to the partial cache lookup. That's the downside of this method, so it might be not best if you have too many non-matching values on full loads.
If your IsFullLookup is false, full cache produces 0 rows quite quickly. All lookups also produce no match very quickly and your rows are all redirected to the partial cache lookup that does its job the regular way. In this case you get pure partial cache lookup with minimal overhead.
This technique makes your packages uglier than before. But if you work with large lookup tables it is totally worth it. In your everyday load you are not spending extra time and memory on large lookups. But in rare case when you need a full (or just a large) load your partial cache does not become a disaster.

Is there / would be feasible a service providing random elements from a given SQL table?

ABSTRACT
Talking with some colleagues we came accross the "extract random row from a big database table" issue. It's a classic one and we know the naive approach (also on SO) is usually something like:
SELECT * FROM mytable ORDER BY RAND() LIMIT 1
THE PROBLEM
We also know a query like that is utterly inefficient and actually usable only with very few rows. There are some approaches that could be taken to attain better efficiency, like these ones still on SO, but they won't work with arbitrary primary keys and the randomness will be skewed as soon as you have holes in your numeric primary keys. An answer to the last cited question links to this article which has a good explanation and some bright solutions involving an additional "equal distribution" table that must be maintained whenever the "master data" table changes. But then again if you have frequent DELETEs on a big table you'll probably be screwed up by the constant updating of the added table. Also note that many solutions rely on COUNT(*) which is ridiculously fast on MyISAM but "just fast" on InnoDB (I don't know how it performs on other platforms but I suspect the InnoDB case could be representative of other transactional database systems).
In addition to that, even the best solutions I was able to find are fast but not Ludicrous Speed fast.
THE IDEA
A separate service could be responsible to generate, buffer and distribute random row ids or even entire random rows:
it could choose the best method to extract random row ids depending on how the original PKs are structured. An ordered list of keys could be maintained in ram by the service (shouldn't take too many bytes per row in addition to the actual size of the PK, it's probably ok up to 100~1000M rows with standard PCs and up to 1~10 billion rows with a beefy server)
once the keys are in memory you have an implicit "row number" for each key and no holes in it so it's just a matter of choosing a random number and directly fetch the corresponding key
a buffer of random keys ready to be consumed could be maintained to quickly respond to spikes in the incoming requests
consumers of the service will connect and request N random rows from the buffer
rows are returned as simple keys or the service could maintain a (pool of) db connection(s) to fetch entire rows
if the buffer is empty the request could block or return EOF-like
if data is added to the master table the service must be signaled to add the same data to its copy too, flush the buffer of random picks and go on from that
if data is deleted from the master table the service must be signaled to remove that data too from both the "all keys" list and "random picks" buffer
if data is updated in the master table the service must be signaled to update corresponding rows in the key list and in the random picks
WHY WE THINK IT'S COOL
does not touch disks other than the initial load of keys at startup or when signaled to do so
works with any kind of primary key, numerical or not
if you know you're going to update a large batch of data you can just signal it when you're done (i.e. not at every single insert/update/delete on the original data), it's basically like having a fine grained lock that only blocks requests for random rows
really fast on updates of any kind in the original data
offloads some work from the relational db to another, memory only process: helps scalability
responds really fast from its buffers without waiting for any querying, scanning, sorting
could easily be extended to similar use cases beyond the SQL one
WHY WE THINK IT COULD BE A STUPID IDEA
because we had the idea without help from any third party
because nobody (we heard of) has ever bothered to do something similar
because it adds complexity in the mix to keep it updated whenever original data changes
AND THE QUESTION IS...
Does anything similar already exists? If not, would it be feasible? If not, why?
The biggest risk with your "cache of eligible primary keys" concept is keeping the cache up to date, when the origin data is changing continually. It could be just as costly to keep the cache in sync as it is to run the random queries against the original data.
How do you expect to signal the cache that a value has been added/deleted/updated? If you do it with triggers, keep in mind that a trigger can fire even if the transaction that spawned it is rolled back. This is a general problem with notifying external systems from triggers.
If you notify the cache from the application after the change has been committed in the database, then you have to worry about other apps that make changes without being fitted with the signaling code. Or ad hoc queries. Or queries from apps or tools for which you can't change the code.
In general, the added complexity is probably not worth it. Most apps can tolerate some compromise and they don't need an absolutely random selection all the time.
For example, the inequality lookup may be acceptable for some needs, even with the known weakness that numbers following gaps are chosen more often.
Or you could pre-select a small number of random values (e.g. 30) and cache them. Let app requests choose from these. Every 60 seconds or so, refresh the cache with another set of randomly chosen values.
Or choose a random value evenly distributed between MIN(id) and MAX(id). Try a lookup by equality, not inequality. If the value corresponds to a gap in the primary key, just loop and try again with a different random value. You can terminate the loop if it's not successful after a few tries. Then try another method instead. On average, the improved simplicity and speed of an equality lookup may make up for the occasional retries.
It appears you are basically addressing a performance issue here. Most DB performance experts recommend you have as much RAM as your DB size, then disk is no longer a bottleneck - your DB lives in RAM and flushes to disk as required.
You're basically proposing a custom developed in-RAM CDC Hashing system.
You could just build this as a standard database only application and lock your mapping table in RAM, if your DB supports this.
I guess I am saying that you can address performance issues without developing custom applications, just use already existing performance tuning methods.

What is the algorithm for query search in the database?

Good day everyone, I'm currently doing research on search algorithm optimization.
As of now, I'm researching on the Database.
In a database w/ SQL Support.
I can write the query for a specific table.
Select Number from Table1 where Name = "Test";
Select * from Table1 where Name = "Test";
1 searches the number from Table1 from where the Name is Test and 2 searches all the column for name Test.
I understand the concept of the function however what I'm interested in learning what is the approach of the search?
Is it just plain linear search where from the first index until the nth index it will grab so long as the condition is true thus having O(n) speed or does it have a unique algorithm that speeds its process?
If there's no indexes, then yes, a linear search is performed.
But, databases typically use a B Tree index when you specify a column(s) as a key. These are special data structure formats that are specifically tuned(high B Tree branching factors) to perform well on magnetic disk hardware, where the most significant time consuming factor is the seek operation(the magnetic head has to move to a diff part of the file).
You can think of the index as a sorted/structured copy of the values in a column. It can be determined quickly if the value being searched for is in the index. If it finds it, then it will also find a pointer that points back to the correct location of the corresponding row in the main data file(so it can go and read the other columns in the row). Sometimes a multi-column index contains all the data requested by the query, and then it doesn't need to skip back to the main file, it can just read what it found and then its done.
There's other types of indexes, but I think you get the idea - duplicate data and arrange it in a way that's fast to search.
On a large database, indexes make the difference between waiting a fraction of a second, vs possibly days for a complex query to complete.
btw- B tree's aren't a simple and easy to understand data structure, and the traversal algorithm is also complex. In addition, the traversal is even uglier than most of the code you will find, because in a database they are constantly loading/unloading chunks of data from disk and managing it in memory, and this significantly uglifies the code. But, if you're familiar with binary search trees, then I think you understand the concept well enough.
Well, it depends on how the data is stored and what are you trying to do.
As already indicated, a common structure for maintaining entries is a B+ tree. The tree is well optimized for disk since the actual data is stored only in leaves - and the keys are stored in the internal nodes. It usually allows a very small number of disk accesses since the top k levels of the tree can be stored in RAM, and only the few bottom levels will be stored on disk and require a disk read for each.
Other alternative is a hash table. You maintain in memory (RAM) an array of "pointers" - these pointers indicate a disk address, which contains a bucket that includes all entries with the corresponding hash value. Using this method, you only need O(1) disk accesses (which is usually the bottleneck when dealing with data bases), so it should be relatively fast.
However, a hash table does not allow efficient range queries (which can be efficiently done in a B+ tree).
The disadvantage of all of the above is that it requires a single key - i.e. if the hash table or B+ tree is built according to the field "id" of the relation, and then you search according to "key" - it becomes useless.
If you want to guarantee fast search for all fields of the relation - you are going to need several structures, each according to a different key - which is not very memory efficient.
Now, there are many optimizations to be considered according to the specific usage. If for example, number of searches is expected to be very small (say smaller loglogN of total ops), maintaining a B+ tree is overall less efficient then just storing the elements as a list and on the rare occasion of a search - just do a linear search.
Very gOod question, but it can have many answers depending on the structure of your table and how is normalized...
Usually to perform a seacrh in a SELECT query the DBMS sorts the table (it uses mergesort because this algorithm is good for I/O in disc, not quicksort) then depending on indexes (if the table has) it just match the numbers, but if the structure is more complex the DBMS can perform a search in a tree, but this is too deep, let me research again in my notes I took.
I recommend activating the query execution plan, here is an example in how to do so in Sql Server 2008. And then execute your SELECT statement with the WHERE clause and you will be able to begin understanding what is going on inside the DBMS.

Analysis Services Partitioning Issue

I have a Measure Group, that is partitioned daily. I can process a particular partition, and the XMLA command completes successfully. Furthermore, I have ensure at least one partition is processed for every Measure Group, therefore my cube is "partially processed" and I should be able to browse it.
The problem... no data can be seen in the cube for any of the Measures within this Measure Group. What is really driving me crazy is that I can capture the SQL command that SSAS is executing against the server, and it returns rows!
Yet sure enough, when I check the properties of the partition I just processed, it says it's size is 0.0 MB. It also has no slice, don't know if that helps.
If anyone has seen anything like this and has any idea... I am all ears.
You must set a partition slice. That is how SSAS determines that the data should reside in that partition. Without the slicer it is discarding the rows read. Check out http://sqlcat.com/technicalnotes/archive/2007/09/11/ssas-partition-slicing.aspx for example.
Your immediate problem is unlikely to be because of the missing slices. As Mosha explains it here, defining slicing details for partitions is very important for performance reasons. Here is a quote from him:
If the data slice value for a partition is set properly, Analysis Services can quickly eliminate irrelevant partitions from the query processing and significantly reduce the amount of physical I/O and processor time needed for many queries issued against MOLAP and HOLAP partitions
Without the data slice, Analysis Services cannot limit a query to the appropriate partitions and must scan each partition even if zero cells will be returned.
The above says that if partition slices are not defined, then SSAS will not be able to optimize certain queries by scanning only from the relevant partitions.
But it also says that with no slices defined it should still return the correct results, albeit probably much slower. As a side-note, it is also implied that if slices ARE defined BUT improperly, then it could happen that wrong results get returned, or nothing at all.
Since your partitions have no slices defined, rather the problem somehow must be with the SQL query bindings used for creating the partitions. Have you checked that the data source is configured correctly in SSAS? When you were running the query manually you might have been connected to a different SQL server instance than the one configured for the SSAS cube (e.g. UAT vs. PROD).