Loading Hybrid Dimension Table with SCD1 and SCD2 attributes + SSIS - ssis

I am just in a process of starting a new task, wherein in i need to load Hybrid Dimension Table with SCD1 and SCD2. This need to be achieved as a SSIS Package. Can someone guide what would be the best way dealing this in SSIS, should i used SCD component or there is other way? What are the best practices for this.
For SCD2 type, am using Merge statement.
Thanks

That's a can of worms :)
There are basically four ways to handle SCDs in SSIS:
1. Using the built-in SCD component
2. "Rolling your own" using Lookups, Conditional Splits, Derived Columns, and various destinations.
3. Using T-SQL MERGE
4. Using the third party Kimball SCD component
I'll alert you to my bias towards #4 - I wrote it. But here's my analysis of the bunch.
1 is a good solution for "small" and "easy" dimensions. Why is it good? It's understandable, handles SCD 1 and 2, and is easy to set up. But why only "small" and "easy" dimensions? Because it uses an internal uncached lookup (RBAR) that can't be improved. Because if you change anything in it (re-run the wizard), it destroys any changes you've made to the data flow. And because it won't handle rows where case sensitivity isn't important, or trailing spaces aren't important.
2 is a good solution for larger dimensions. It's good because it performs pretty well, and is "well documented" in that you can see exactly what it's doing from the names of the components you use and how they're put together. It's also easy to manipulate and change how it operates. The downside is that it takes time to set up and test.
3 is a good solution for huge dimensions. It usually outperforms all other alternatives. But that's about all it has going for it. It's very complex to code, and not very understandable without tons of comments.
4 is a good solution for just about any size except maybe "huge" dimensions. It's "easy" to use like the stock SCD component, performs as good or better than 2, and is as configurable as 2.
More info on 4 here.

Related

Looking for a scalable database solution

In the physics lab I work in, at the beginning of each experiment (of which we run several per minute) we perform a series of Inserts into a MariaDB database. One of the tables has a few hundred columns - each corresponding to a named variable - and serves as a log of the parameters used during that run. For example, one variable is the laser power used during a particular step of the experiment.
Over time, experimenters add new variables to parametrize new steps of the experiment. Originally my code handled this by simply adding new columns to the table - but as the number rows in the table increased beyond around 60000, the time it took to add a column became unusably long (over a minute).
For now, I have circumvented the problem by pre-defining a few hundred extra columns which can be renamed as new variables are needed. At the rate at which variables are added, however, this will only last our lab (or the other labs that use this software) a few years before table maintenance is required. I am wondering whether anyone can recommend a different architecture or different platform that would provide a natural solution to this "number of columns" problem.
I am guessing you are running various different types of experiments and that is why you need an ever increasing number of variables? If that is the case, you may want to consider either:
having a separate table for each type of experiment,
separate tables to hold each experiment type's parameter values (that reference the experiment in the primary table),
have a simpler "experiment parameters" table that has 3 (or more, depending on complexity of values) references: the experiment, the parameter, and parameter value.
My preference would be to one of the first two options, the third one tends to make data a bit more complicated, to analyze AND maintain, than the flexibility is worth.
It would seem that EAV is best for your scenario. I would always steer away from it, but in this case it seems to make sense. I would keep the last n experiments of data in the main table(s), and dog off the other ones to an archive table. Naturally you would know of the speed increases in archiving away data not needed at the moment, yet always available with joins to larger tables.
For an introduction into EAV, see a web ddocument by Rick James (a stackoverflow User). Also, visit the questions available on the stack here.
Everytime I look at EAV I wonder why in the world would anyone use it to program against. But just imagining the academic/experimental/ad-hoc environment that you must work in, I can't help but think it might be the best one for you. The following is a high-level exploratory question entitled Should I use EAV model?.

Waiting in pentaho kettle-spoon

I am new to the Pentaho Kettle and I want to do multiple operations in a Transformation.
Firstly I am
inserting data from a text file to a main table.
Loading some of the columns from the main table to a 2nd table based on some conditions.
But the problem is only after completing the first step I have to do the 2nd step. Because for the 2nd step i need the 1st step to be completed.
I can say that my first step is taking almost 20 mins..
Also in that same transformation i have to do other data loading from different table too..
I don't know kettle is providing a dedicated option to perform that like any switches or something like that.I have searched a lot in web but I didn't got any ...
So can anyone help me in solving the problem.
that's what exactly the "Blocking Step" do, give a try.
http://www.nicholasgoodman.com/bt/blog/2008/06/25/ordered-rows-in-kettle/
Or split your transform into multiple transforms and orchestrate them in a Job. If your transforms are simple, I would tend towards using the blocking steps. But using them too much I find makes the transforms messy and complex. Wrapping Transforms in Jobs usually gives you more control.
Brian

Slowly changing dimension type 2 implementation in SSIS

what is the best way to implement the type 2 dimension in SSIS from the following options:
1.using merge statement.
2.using SSIS SCD component
3.using lookup, conditional split(custom way to implement it).
could you please give me the detail on it how to implement all of these and pros and cons of each.
thanks in advance.
Zaim raza.
4) Dimension Merge SCD
Even if you don't go with that component, watch the 6 part videos linked from that page and see which one works best for your workload and skill-set.

DB design question - multiplayer game

I am new to DB design. I am trying to write a board game (4 players max) and was trying to come up with a way to communicate moves among each other.
I am using a DB for this as per suggestions on stackoverflow.
My problem is this - When player A makes a move that move has to be read by B,C and D. So the fact that A made the move needs to be communicated to B,C and D. I am doing it the following way. Please tell me if there is a better way to do it. To me it seems all wrong and incredibly flaky.
I have a table with the following fields -
gameId, userMove, flagA, flagB, flagC, flagD
So when A makes the move I write among other things - (flagA=0, flagB=1, flagC=1, flagD=1)
When B,C or D read A's move they decrement their corresponding flag.
A will not update the table unless all flags are 0.
Same thing happens when others make their moves.
Comments? There has to be a better way for this. The things I am seeing wrong here -
I am looping on a select until all flags are 0 for A
I am looping on a select until the flag for the corresponding user is set to read the move.
That is a lot of server load and client timeouts I need to worry about.
I hope I have been able to explain my problem clearly. Please ask questions if needed.
Any help is appreciated.
EDIT: The game is web based (runs in a browser) and I am using php for the server side development and so I cannot use an in-memory cache though I would have loved to do that if possible.
Thanks,
- Pav
If the players of your game will be interacting with one game server during a single game session, it looks like you can keep all that state in memory.
Databases are great for durable storage of data with guarantees for atomicity, consistency and integrity. However, you don't seem to need any of these features for the temporal state that you are describing.
If flagA,B,C and D are all bits you might consider putting them all into one column and treating that column as a bit mask.
This will allow one column to control all flags. It can make your selects and updates much cleaner.
Read up on using bitmasks here:
http://www.gordano.com/kb.htm?q=966
http://en.wikipedia.org/wiki/Mask_%28computing%29
Have you considered usng a file to store the info?

Seeking clarifications about structuring code to reduce cyclomatic complexity

Recently our company has started measuring the cyclomatic complexity (CC) of the functions in our code on a weekly basis, and reporting which functions have improved or worsened. So we have started paying a lot more attention to the CC of functions.
I've read that CC could be informally calculated as 1 + the number of decision points in a function (e.g. if statement, for loop, select etc), or also the number of paths through a function...
I understand that the easiest way of reducing CC is to use the Extract Method refactoring repeatedly...
There are somethings I am unsure about, e.g. what is the CC of the following code fragments?
1)
for (int i = 0; i < 3; i++)
Console.WriteLine("Hello");
And
Console.WriteLine("Hello");
Console.WriteLine("Hello");
Console.WriteLine("Hello");
They both do the same thing, but does the first version have a higher CC because of the for statement?
2)
if (condition1)
if (condition2)
if (condition 3)
Console.WriteLine("wibble");
And
if (condition1 && condition2 && condition3)
Console.WriteLine("wibble");
Assuming the language does short-circuit evaluation, such as C#, then these two code fragments have the same effect... but is the CC of the first fragment higher because it has 3 decision points/if statements?
3)
if (condition1)
{
Console.WriteLine("one");
if (condition2)
Console.WriteLine("one and two");
}
And
if (condition3)
Console.WriteLine("fizz");
if (condition4)
Console.WriteLine("buzz");
These two code fragments do different things, but do they have the same CC? Or does the nested if statement in the first fragment have a higher CC? i.e. nested if statements are mentally more complex to understand, but is that reflected in the CC?
Yes. Your first example has a decision point and your second does not, so the first has a higher CC.
Yes-maybe, your first example has multiple decision points and thus a higher CC. (See below for explanation.)
Yes-maybe. Obviously they have the same number of decision points, but there are different ways to calculate CC, which means ...
... if your company is measuring CC in a specific way, then you need to become familiar with that method (hopefully they are using tools to do this). There are different ways to calculate CC for different situations (case statements, Boolean operators, etc.), but you should get the same kind of information from the metric no matter what convention you use.
The bigger problem is what others have mentioned, that your company seems to be focusing more on CC than on the code behind it. In general, sure, below 5 is great, below 10 is good, below 20 is okay, 21 to 50 should be a warning sign, and above 50 should be a big warning sign, but those are guides, not absolute rules. You should probably examine the code in a procedure that has a CC above 50 to ensure it isn't just a huge heap of code, but maybe there is a specific reason why the procedure is written that way, and it's not feasible (for any number of reasons) to refactor it.
If you use tools to refactor your code to reduce CC, make sure you understand what the tools are doing, and that they're not simply shifting one problem to another place. Ultimately, you want your code to have few defects, to work properly, and to be relatively easy to maintain. If that code also has a low CC, good for it. If your code meets these criteria and has a CC above 10, maybe it's time to sit down with whatever management you can and defend your code (and perhaps get them to examine their policy).
After browsing thru the wikipedia entry and on Thomas J. McCabe's original paper, it seems that the items you mentioned above are known problems with the metric.
However, most metrics do have pros and cons. I suppose in a large enough program the CC value could point to possibly complex parts of your code. But that higher CC does not necessarily mean complex.
Like all software metrics, CC is not perfect. Used on a big enough code base, it can give you an idea of where might be a problematic zone.
There are two things to keep in mind here:
Big enough code base: In any non trivial project you will have functions that have a really high CC value. So high that it does not matter if in one of your examples, the CC would be 2 or 3. A function with a CC of let's say over 300 is definitely something to analyse. Doesn't matter if the CC is 301 or 302.
Don't forget to use your head. There are methods that need many decision points. Often they can be refactored somehow to have fewer, but sometimes they can't. Do not go with a rule like "Refactor all methods with a CC > xy". Have a look at them and use your brain to decide what to do.
I like the idea of a weekly analysis. In quality control, trend analysis is a very effective tool for indentifying problems during their creation. This is so much better than having to wait until they get so big that they become obvious (see SPC for some details).
CC is not a panacea for measuring quality. Clearly a repeated statement is not "better" than a loop, even if a loop has a bigger CC. The reason the loop has a bigger CC is that sometimes it might get executed and sometimes it might not, which leads to two different "cases" which should both be tested. In your case the loop will always be executed three times because you use a constant, but CC is not clever enough to detect this.
Same with the chained ifs in example 2 - this structure allows you to have a statment which would be executed if only condition1 and condition2 is true. This is a special case which is not possible in the case using &&. So the if-chain has a bigger potential for special cases even if you dont utilize this in your code.
This is the danger of applying any metric blindly. The CC metric certainly has a lot of merit but as with any other technique for improving code it can't be evaluated divorced from context. Point your management at Casper Jone's discussion of the Lines of Code measurement (wish I could find a link for you). He points out that if Lines of Code is a good measure of productivity then assembler language developers are the most productive developers on earth. Of course they're no more productive than other developers; it just takes them a lot more code to accomplish what higher level languages do with less source code. I mention this, as I say, so you can show your managers how dumb it is to blindly apply metrics without intelligent review of what the metric is telling you.
I would suggest that if they're not, that your management would be wise to use the CC measure as a way of spotting potential hot spots in the code that should be reviewed further. Blindly aiming for the goal of lower CC without any reference to code maintainability or other measures of good coding is just foolish.
Cyclomatic complexity is analogous to temperature. They are both measurements, and in most cases meaningless without context. If I said the temperature outside was 72 degrees that doesn’t mean much; but if I added the fact that I was at North Pole, the number 72 becomes significant. If someone told me a method has a cyclomatic complexity of 10, I can’t determine if that is good or bad without its context.
When I code review an existing application, I find cyclomatic complexity a useful “starting point” metric. The first thing I check for are methods with a CC > 10. These “>10” methods are not necessarily bad. They just provide me a starting point for reviewing the code.
General rules when considering a CC number:
The relationship between CC # and # of tests, should be CC# <= #tests
Refactor for CC# only if it increases
maintainability
CC above 10 often indicates one or more Code Smells
[Off topic] If you favor readability over good score in the metrics (Was it J.Spolsky that said, "what's measured, get's done" ? - meaning that metrics are abused more often than not I suppose), it is often better to use a well-named boolean to replace your complex conditional statement.
then
if (condition1 && condition2 && condition3)
Console.WriteLine("wibble");
become
bool/boolean theWeatherIsFine = condition1 && condition2 && condition3;
if (theWeatherIsFine)
Console.WriteLine("wibble");
I'm no expert at this subject, but I thought I would give my two cents. And maybe that's all this is worth.
Cyclomatic Complexity seems to be just a particular automated shortcut to finding potentially (but not definitely) problematic code snippets. But isn't the real problem to be solved one of testing? How many test cases does the code require? If CC is higher, but number of test cases is the same and code is cleaner, don't worry about CC.
1.) There is no decision point there. There is one and only one path through the program there, only one possible result with either of the two versions. The first is more concise and better, Cyclomatic Complexity be damned.
1 test case for both
2.) In both cases, you either write "wibble" or you don't.
2 test cases for both
3.) First one could result in nothing, "one", or "one" and "one and two". 3 paths. 2nd one could result in nothing, either of the two, or both of them. 4 paths.
3 test cases for the first
4 test cases for the second