Can a significance test be one-sided? - regression

I am conducting a regression analysis and my aim is to see whether a theoretical approach applies to the data. The data of the dependent variable is seperated to three categories. Therefore, I have 1 of them as referrence and added 2 dummy variables.
So, before looking at data, I expect one of my dummy variable to have a negative coefficient, the other to have positive coefficient. These expectations are due to the theory which I am trying to test with data.
My question is whether I can consider the significance tests of the dummy variables stated above as one sided or not.

So, the question essentially is whether the assumption that
dummy1 * dummy2 < 0
is premature or not. Cases:
We know that it is the case
In this scenario it is correct to assume that, but you also need to make sure that your software properly validates new input and fixes old data that would violate this rule.
We do not know whether that's the case, but we know that it should be the case
In this scenario your assumption is premature if you do not have a test and data validation for dummy1 having a different sign from dummy2. If you have validation/test/fixes and you do not have counter-examples, then it seems not to be premature.
We only have the theory
In this case this is premature until it is proven either mathematically or via engineering means.

Related

Simulating a matrix of variables with predefined correlation structure

For a simulation study I am working on, we are trying to test an algorithm that aims to identify specific culprit factors that predict a binary outcome of interest from a large mixture of possible exposures that are mostly unrelated to the outcome. To test this algorithm, I am trying to simulate the following data:
A binary dependent variable
A set of, say, 1000 variables, most binary and some continuous, that are not associated with the outcome (that is, are completely independent from the binary dependent variable, but that can still be correlated with one another).
A group of 10 or so binary variables which will be associated with the dependent variable. I will a-priori determine the magnitude of the correlation with the binary dependent variable, as well as their frequency in the data.
Generating a random set of binary variables is easy. But is there a way of doing this while ensuring that none of these variables are correlated with the dependent outcome?
Thank you!
"But is there a way of doing this while ensuring that none of these variables are correlated with the dependent outcome?"
With statistical sampling you can't ensure anything, you can only adjust the acceptable risk. Finding an acceptable level of risk may be harder than many people think.
Spurious correlations are a very real phenomenon. Real independent observations will often contain correlations, and if you want to actually test your algorithm to see how it will perform in reality then your tests should produce such phenomena in a manner similar to the real world—you should be generating independent candidate factors and allowing spurious correlations to occur.
If you are performing ~1000 independent tests of candidate factors, and you're targeting a risk level of α = 0.05, you can expect 50 non-significant terms to leak through into your analysis. To avoid this, you need to adjust your testing threshold using something along the lines of a Bonferroni correction. Recall that statistical discriminating power is based on standard error, which is inversely proportional to the square root of the sample size. Bonferroni says that 1000 simultaneous tests need their individual test threshold to be adjusted by a factor of 1000, which in turn means the sample size needs to be a million times larger than when performing a single test for significance.
So in summary I'd say that you shouldn't attempt to ensure lack of correlation, it's going to occur in the real world. You can mitigate the risk of non-predictive factors being included due to spurious correlation by generating massive amounts of data. In practice there will be non-predictors that leak through unless you can obtain enough data, so I'd suggest that your testing should address the rates of occurrence as a function of number of candidate factors and the sample size.

Why would a language NOT use Short-circuit evaluation?

Why would a language NOT use Short-circuit evaluation? Are there any benefits of not using it?
I see that it could lead to some performances issues... is that true? Why?
Related question : Benefits of using short-circuit evaluation
Reasons NOT to use short-circuit evaluation:
Because it will behave differently and produce different results if your functions, property Gets or operator methods have side-effects. And this may conflict with: A) Language Standards, B) previous versions of your language, or C) the default assumptions of your languages typical users. These are the reasons that VB has for not short-circuiting.
Because you may want the compiler to have the freedom to reorder and prune expressions, operators and sub-expressions as it sees fit, rather than in the order that the user typed them in. These are the reasons that SQL has for not short-circuiting (or at least not in the way that most developers coming to SQL think it would). Thus SQL (and some other languages) may short-circuit, but only if it decides to and not necessarily in the order that you implicitly specified.
I am assuming here that you are asking about "automatic, implicit order-specific short-circuiting", which is what most developers expect from C,C++,C#,Java, etc. Both VB and SQL have ways to explicitly force order-specific short-circuiting. However, usually when people ask this question it's a "Do What I Meant" question; that is, they mean "why doesn't it Do What I Want?", as in, automatically short-circuit in the order that I wrote it.
One benefit I can think of is that some operations might have side-effects that you might expect to happen.
Example:
if (true || someBooleanFunctionWithSideEffect()) {
...
}
But that's typically frowned upon.
Ada does not do it by default. In order to force short-circuit evaluation, you have to use and then or or else instead of and or or.
The issue is that there are some circumstances where it actually slows things down. If the second condition is quick to calculate and the first condition is almost always true for "and" or false for "or", then the extra check-branch instruction is kind of a waste. However, I understand that with modern processors with branch predictors, this isn't so much the case. Another issue is that the compiler may happen to know that the second half is cheaper or likely to fail, and may want to reorder the check accordingly (which it couldn't do if short-circuit behavior is defined).
I've heard objections that it can lead to unexpected behavior of the code in the case where the second test has side effects. IMHO it is only "unexpected" if you don't know your language very well, but some will argue this.
In case you are interested in what actual language designers have to say about this issue, here's an excerpt from the Ada 83 (original language) Rationale:
The operands of a boolean expression
such as A and B can be evaluated in
any order. Depending on the complexity
of the term B, it may be more
efficient (on some but not all
machines) to evaluate B only when the
term A has the value TRUE. This
however is an optimization decision
taken by the compiler and it would be
incorrect to assume that this
optimization is always done. In other
situations we may want to express a
conjunction of conditions where each
condition should be evaluated (has
meaning) only if the previous
condition is satisfied. Both of these
things may be done with short-circuit
control forms ...
In Algol 60 one can achieve the effect
of short-circuit evaluation only by
use of conditional expressions, since
complete evaluation is performed
otherwise. This often leads to
constructs that are tedious to follow...
Several languages do not define how
boolean conditions are to be
evaluated. As a consequence programs
based on short-circuit evaluation will
not be portable. This clearly
illustrates the need to separate
boolean operators from short-circuit
control forms.
Look at my example at On SQL Server boolean operator short-circuit which shows why a certain access path in SQL is more efficient if boolean short circuit is not used. My blog example it shows how actually relying on boolean short-circuit can break your code if you assume short-circuit in SQL, but if you read the reasoning why is SQL evaluating the right hand side first, you'll see that is correct and this result in a much improved access path.
Bill has alluded to a valid reason not to use short-circuiting but to spell it in more detail: highly parallel architectures sometimes have problem with branching control paths.
Take NVIDIA’s CUDA architecture for example. The graphics chips use an SIMT architecture which means that the same code is executed on many parallel threads. However, this only works if all threads take the same conditional branch every time. If different threads take different code paths, evaluation is serialized – which means that the advantage of parallelization is lost, because some of the threads have to wait while others execute the alternative code branch.
Short-circuiting actually involves branching the code so short-circuit operations may be harmful on SIMT architectures like CUDA.
– But like Bill said, that’s a hardware consideration. As far as languages go, I’d answer your question with a resounding no: preventing short-circuiting does not make sense.
I'd say 99 times out of 100 I would prefer the short-circuiting operators for performance.
But there are two big reasons I've found where I won't use them.
(By the way, my examples are in C where && and || are short-circuiting and & and | are not.)
1.) When you want to call two or more functions in an if statement regardless of the value returned by the first.
if (isABC() || isXYZ()) // short-circuiting logical operator
//do stuff;
In that case isXYZ() is only called if isABC() returns false. But you may want isXYZ() to be called no matter what.
So instead you do this:
if (isABC() | isXYZ()) // non-short-circuiting bitwise operator
//do stuff;
2.) When you're performing boolean math with integers.
myNumber = i && 8; // short-circuiting logical operator
is not necessarily the same as:
myNumber = i & 8; // non-short-circuiting bitwise operator
In this situation you can actually get different results because the short-circuiting operator won't necessarily evaluate the entire expression. And that makes it basically useless for boolean math. So in this case I'd use the non-short-circuiting (bitwise) operators instead.
Like I was hinting at, these two scenarios really are rare for me. But you can see there are real programming reasons for both types of operators. And luckily most of the popular languages today have both. Even VB.NET has the AndAlso and OrElse short-circuiting operators. If a language today doesn't have both I'd say it's behind the times and really limits the programmer.
If you wanted the right hand side to be evaluated:
if( x < 13 | ++y > 10 )
printf("do something\n");
Perhaps you wanted y to be incremented whether or not x < 13. A good argument against doing this, however, is that creating conditions without side effects is usually better programming practice.
As a stretch:
If you wanted a language to be super secure (at the cost of awesomeness), you would remove short circuit eval. When something 'secure' takes a variable amount of time to happen, a Timing Attack could be used to mess with it. Short circuit eval results in things taking different times to execute, hence poking the hole for the attack. In this case, not even allowing short circuit eval would hopefully help write more secure algorithms (wrt timing attacks anyway).
The Ada programming language supported both boolean operators that did not short circuit (AND, OR), to allow a compiler to optimize and possibly parallelize the constructs, and operators with explicit request for short circuit (AND THEN, OR ELSE) when that's what the programmer desires. The downside to such a dual-pronged approach is to make the language a bit more complex (1000 design decisions taken in the same "let's do both!" vein will make a programming language a LOT more complex overall;-).
Not that I think this is what's going on in any language now, but it would be rather interesting to feed both sides of an operation to different threads. Most operands could be pre-determined not to interfere with each other, so they would be good candidates for handing off to different CPUs.
This kins of thing matters on highly parallel CPUs that tend to evaluate multiple branches and choose one.
Hey, it's a bit of a stretch but you asked "Why would a language"... not "Why does a language".
The language Lustre does not use short-circuit evaluation. In if-then-elses, both then and else branches are evaluated at each tick, and one is considered the result of the conditional depending on the evaluation of the condition.
The reason is that this language, and other synchronous dataflow languages, have a concise syntax to speak of the past. Each branch needs to be computed so that the past of each is available if it becomes necessary in future cycles. The language is supposed to be functional, so that wouldn't matter, but you may call C functions from it (and perhaps notice they are called more often than you thought).
In Lustre, writing the equivalent of
if (y <> 0) then 100/y else 100
is a typical beginner mistake. The division by zero is not avoided, because the expression 100/y is evaluated even on cycles when y=0.
Because short-circuiting can change the behavior of an application IE:
if(!SomeMethodThatChangesState() || !SomeOtherMethodThatChangesState())
I'd say it's valid for readability issues; if someone takes advantage of short circuit evaluation in a not fully obvious way, it can be hard for a maintainer to look at the same code and understand the logic.
If memory serves, erlang provides two constructs, standard and/or, then andalso/orelse . This clarifies intend that 'yes, I know this is short circuiting, and you should too', where as at other points the intent needs to be derived from code.
As an example, say a maintainer comes across these lines:
if(user.inDatabase() || user.insertInDatabase())
user.DoCoolStuff();
It takes a few seconds to recognize that the intent is "if the user isn't in the Database, insert him/her/it; if that works do cool stuff".
As others have pointed out, this is really only relevant when doing things with side effects.
I don't know about any performance issues, but one possible argumentation to avoid it (or at least excessive use of it) is that it may confuse other developers.
There are already great responses about the side-effect issue, but I didn't see anything about the performance aspect of the question.
If you do not allow short-circuit evaluation, the performance issue is that both sides must be evaluated even though it will not change the outcome. This is usually a non-issue, but may become relevant under one of these two circumstances:
The code is in an inner loop that is called very frequently
There is a high cost associated with evaluating the expressions (perhaps IO or an expensive computation)
The short-circuit evaluation automatically provides conditional evaluation of a part of the expression.
The main advantage is that it simplifies the expression.
The performance could be improved but you could also observe a penalty for very simple expressions.
Another consequence is that side effects of the evaluation of the expression could be affected.
In general, relying on side-effect is not a good practice, but in some specific context, it could be the preferred solution.
VB6 doesn't use short-circuit evaluation, I don't know if newer versions do, but I doubt it. I believe this is just because older versions didn't either, and because most of the people who used VB6 wouldn't expect that to happen, and it would lead to confusion.
This is just one of the things that made it extremely hard for me to get out of being a noob VB programmer who wrote spaghetti code, and get on with my journey to be a real programmer.
Many answers have talked about side-effects. Here's a Python example without side-effects in which (in my opinion) short-circuiting improves readability.
for i in range(len(myarray)):
if myarray[i]>5 or (i>0 and myarray[i-1]>5):
print "At index",i,"either arr[i] or arr[i-1] is big"
The short-circuit ensures we don't try to access myarray[-1], which would raise an exception since Python arrays start at 0. The code could of course be written without short-circuits, e.g.
for i in range(len(myarray)):
if myarray[i]<=5: continue
if i==0: continue
if myarray[i-1]<=5: continue
print "At index",i,...
but I think the short-circuit version is more readable.

How to correct the user input (Kind of google "did you mean?")

I have the following requirement: -
I have many (say 1 million) values (names).
The user will type a search string.
I don't expect the user to spell the names correctly.
So, I want to make kind of Google "Did you mean". This will list all the possible values from my datastore. There is a similar but not same question here. This did not answer my question.
My question: -
1) I think it is not advisable to store those data in RDBMS. Because then I won't have filter on the SQL queries. And I have to do full table scan. So, in this situation how the data should be stored?
2) The second question is the same as this. But, just for the completeness of my question: how do I search through the large data set?
Suppose, there is a name Franky in the dataset.
If a user types as Phranky, how do I match the Franky? Do I have to loop through all the names?
I came across Levenshtein Distance, which will be a good technique to find the possible strings. But again, my question is do I have to operate on all 1 million values from my data store?
3) I know, Google does it by watching users behavior. But I want to do it without watching user behavior, i.e. by using, I don't know yet, say distance algorithms. Because the former method will require large volume of searches to start with!
4) As Kirk Broadhurst pointed out in an answer below, there are two possible scenarios: -
Users mistyping a word (an edit
distance algorithm)
Users not knowing a word and guessing
(a phonetic match algorithm)
I am interested in both of these. They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
The Soundex algorithm may help you out with this.
http://en.wikipedia.org/wiki/Soundex
You could pre-generate the soundex values for each name and store it in the database, then index that to avoid having to scan the table.
the Bitap Algorithm is designed to find an approximate match in a body of text. Maybe you could use that to calculate probable matches. (it's based on the Levenshtein Distance)
(Update: after having read Ben S answer (use an existing solution, possibly aspell) is the way to go)
As others said, Google does auto correction by watching users correct themselves. If I search for "someting" (sic) and then immediately for "something" it is very likely that the first query was incorrect. A possible heuristic to detect this would be:
If a user has done two searches in a short time window, and
the first query did not yield any results (or the user did not click on anything)
the second query did yield useful results
the two queries are similar (have a small Levenshtein distance)
then the second query is a possible refinement of the first query which you can store and present to other users.
Note that you probably need a lot of queries to gather enough data for these suggestions to be useful.
I would consider using a pre-existing solution for this.
Aspell with a custom dictionary of the names might be well suited for this. Generating the dictionary file will pre-compute all the information required to quickly give suggestions.
This is an old problem, DWIM (Do What I Mean), famously implemented on the Xerox Alto by Warren Teitelman. If your problem is based on pronunciation, here is a survey paper that might help:
J. Zobel and P. Dart, "Phonetic String Matching: Lessons from Information Retieval," Proc. 19th Annual Inter. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'96), Aug. 1996, pp. 166-172.
I'm told by my friends who work in information retrieval that Soundex as described by Knuth is now considered very outdated.
Just use Solr or a similar search server, and then you won't have to be an expert in the subject. With the list of spelling suggestions, run a search with each suggested result, and if there are more results than the current search query, add that as a "did you mean" result. (This prevents bogus spelling suggestions that don't actually return more relevant hits.) This way, you don't require a lot of data to be collected to make an initial "did you mean" offering, though Solr has mechanisms by which you can hand-tune the results of certain queries.
Generally, you wouldn't be using an RDBMS for this type of searching, instead depending on read-only, slightly stale databases intended for this purpose. (Solr adds a friendly programming interface and configuration to an underlying Lucene engine and database.) On the Web site for the company that I work for, a nightly service selects altered records from the RDBMS and pushes them as a documents into Solr. With very little effort, we have a system where the search box can search products, customer reviews, Web site pages, and blog entries very efficiently and offer spelling suggestions in the search results, as well as faceted browsing such as you see at NewEgg, Netflix, or Home Depot, with very little added strain on the server (particularly the RDBMS). (I believe both Zappo's [the new site] and Netflix use Solr internally, but don't quote me on that.)
In your scenario, you'd be populating the Solr index with the list of names, and select an appropriate matching algorithm in the configuration file.
Just as in one of the answers to the question you reference, Peter Norvig's great solution would work for this, complete with Python code. Google probably does query suggestion a number of ways, but the thing they have going for them is lots of data. Sure they can go model user behavior with huge query logs, but they can also just use text data to find the most likely correct spelling for a word by looking at which correction is more common. The word someting does not appear in a dictionary and even though it is a common misspelling, the correct spelling is far more common. When you find similar words you want the word that is both the closest to the misspelling and the most probable in the given context.
Norvig's solution is to take a corpus of several books from Project Gutenberg and count the words that occur. From those words he creates a dictionary where you can also estimate the probability of a word (COUNT(word) / COUNT(all words)). If you store this all as a straight hash, access is fast, but storage might become a problem, so you can also use things like suffix tries. The access time is still the same (if you implement it based on a hash), but storage requirements can be much less.
Next, he generates simple edits for the misspelt word (by deleting, adding, or substituting a letter) and then constrains the list of possibilities using the dictionary from the corpus. This is based on the idea of edit distance (such as Levenshtein distance), with the simple heuristic that most spelling errors take place with an edit distance of 2 or less. You can widen this as your needs and computational power dictate.
Once he has the possible words, he finds the most probable word from the corpus and that is your suggestion. There are many things you can add to improve the model. For example, you can also adjust the probability by considering the keyboard distance of the letters in the misspelling. Of course, that assumes the user is using a QWERTY keyboard in English. For example, transposing an e and a q is more likely than transposing an e and an l.
For people who are recommending Soundex, it is very out of date. Metaphone (simpler) or Double Metaphone (complex) are much better. If it really is name data, it should work fine, if the names are European-ish in origin, or at least phonetic.
As for the search, if you care to roll your own, rather than use Aspell or some other smart data structure... pre-calculating possible matches is O(n^2), in the naive case, but we know in order to be matching at all, they have to have a "phoneme" overlap, or may even two. This pre-indexing step (which has a low false positive rate) can take down the complexity a lot (to in the practical case, something like O(30^2 * k^2), where k is << n).
You have two possible issues that you need to address (or not address if you so choose)
Users mistyping a word (an edit distance algorithm)
Users not knowing a word and guessing (a phonetic match algorithm)
Are you interested in both of these, or just one or the other? They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
You should pre-index the count of words to ensure you are only suggesting relevant answers (similar to ealdent's suggestion). For example, if I entered sith I might expect to be asked if I meant smith, however if I typed smith it would not make sense to suggest sith. Determine an algorithm which measures the relative likelihood a word and only suggest words that are more likely.
My experience in loose matching reinforced a simple but important learning - perform as many indexing/sieve layers as you need and don't be scared of including more than 2 or 3. Cull out anything that doesn't start with the correct letter, for instance, then cull everything that doesn't end in the correct letter, and so on. You really only want to perform edit distance calculation on the smallest possible dataset as it is a very intensive operation.
So if you have an O(n), an O(nlogn), and an O(n^2) algorithm - perform all three, in that order, to ensure you are only putting your 'good prospects' through to your heavy algorithm.

Contains Test for a Constant Set

The problem statement:
Given a set of integers that is known in advance, generate code to test if a single integer is in the set. The domain of the testing function is the integers in some consecutive range.
Nothing in particular is known now about the range or the set to be tested. The range could be small or huge (but a solution can reject problems that are to big but higher limits are better). It could be that very few of the values in the allowed range are in the set or most of them are or anything in between. The set may be uniformly distributed or clustered. There may be large sections of only contained/not-contained values or there may be at least a few of each type of value in most swaths. (sort of like the assumption made about items to be sorted when analyzing sorting algorithms)
The objective is a procedure for generating effective code for running the test.
Partial solutions that come to mind include
perfect hash function (costly for large sets)
range tests: foreach(b in ranges) if(b.l <= v && v <= b.h) return true;
trees/indexes (more costly than others in some cases)
table lookup (costly for large sets)
the inverse of any of these these (kodos to Jason S)
It seems that an ideal solution would be able to pick what option is best or if none work well, use a tree to break down the full range into sections and then switch to other options for subsection that are better suited to them.
Topic(s) that might be useful include:
Huffman coding
Note: this is not homework. if it was issued as homework below the doctoral level the prof should be shot with a Nerf gun (if you don't get that then re-read the problem, it is very much non trivial)
Note: This is a problem that occurred to me a few days a go and I've been puzzling over it off and on. I have no direct use for this but thought it would be a cool problem to attack. The reason that I wan to generate the code is because generated code will be no slower than general code (it can be the same thing if needed) and might be faster in some/many cases.
I'm posting this question as much to clarify my thoughts as anything. If I can come up with any reasonable or cool solutions I plan on implementing them as a template meta program (the other reason for generated code)
some people have noted that the problem is very general. That is the point. I'm hoping to generate a system that would work an a very general domain: sets of integers in some range.
a previous question on dictionary/spellchecking had a number of responses that mentioned Bloom filters; maybe that would help.
I would think that testing for large sets is going to be expensive no matter what.
let's pretend, for a moment, that this is a real question:
there are no limits on the size of the base set or the input set
this makes the "problem" unrealistic, underspecified, and un-solvable in any practical sense
if someone wants to posit a solution, here's some unit test cases:
unit test 1:
the base set is all integers between -1,000,000,000,000 and +1,000,000,000,000 except for 100,000,000,000 randomly-removed values
the input set is 100,000,000,000 randomly-generated integers in the same range
unit test 2:
the base set is the Fibonacci series
the input set is 1T randomly-generated integers in the range 0..infinity
there's also boost::dynamic_bitset, not sure how it scales for time, or in space with respect to distribution of original numbers. (e.g. if the bits are stored in chunks of 8/16/32/64, then sparse bitsets are inefficient)
or perhaps this (compressed bit set) or this (bit vector) webpage (I googled for "large sparse bit sets" and "compressed bit sets")

Seeking clarifications about structuring code to reduce cyclomatic complexity

Recently our company has started measuring the cyclomatic complexity (CC) of the functions in our code on a weekly basis, and reporting which functions have improved or worsened. So we have started paying a lot more attention to the CC of functions.
I've read that CC could be informally calculated as 1 + the number of decision points in a function (e.g. if statement, for loop, select etc), or also the number of paths through a function...
I understand that the easiest way of reducing CC is to use the Extract Method refactoring repeatedly...
There are somethings I am unsure about, e.g. what is the CC of the following code fragments?
1)
for (int i = 0; i < 3; i++)
Console.WriteLine("Hello");
And
Console.WriteLine("Hello");
Console.WriteLine("Hello");
Console.WriteLine("Hello");
They both do the same thing, but does the first version have a higher CC because of the for statement?
2)
if (condition1)
if (condition2)
if (condition 3)
Console.WriteLine("wibble");
And
if (condition1 && condition2 && condition3)
Console.WriteLine("wibble");
Assuming the language does short-circuit evaluation, such as C#, then these two code fragments have the same effect... but is the CC of the first fragment higher because it has 3 decision points/if statements?
3)
if (condition1)
{
Console.WriteLine("one");
if (condition2)
Console.WriteLine("one and two");
}
And
if (condition3)
Console.WriteLine("fizz");
if (condition4)
Console.WriteLine("buzz");
These two code fragments do different things, but do they have the same CC? Or does the nested if statement in the first fragment have a higher CC? i.e. nested if statements are mentally more complex to understand, but is that reflected in the CC?
Yes. Your first example has a decision point and your second does not, so the first has a higher CC.
Yes-maybe, your first example has multiple decision points and thus a higher CC. (See below for explanation.)
Yes-maybe. Obviously they have the same number of decision points, but there are different ways to calculate CC, which means ...
... if your company is measuring CC in a specific way, then you need to become familiar with that method (hopefully they are using tools to do this). There are different ways to calculate CC for different situations (case statements, Boolean operators, etc.), but you should get the same kind of information from the metric no matter what convention you use.
The bigger problem is what others have mentioned, that your company seems to be focusing more on CC than on the code behind it. In general, sure, below 5 is great, below 10 is good, below 20 is okay, 21 to 50 should be a warning sign, and above 50 should be a big warning sign, but those are guides, not absolute rules. You should probably examine the code in a procedure that has a CC above 50 to ensure it isn't just a huge heap of code, but maybe there is a specific reason why the procedure is written that way, and it's not feasible (for any number of reasons) to refactor it.
If you use tools to refactor your code to reduce CC, make sure you understand what the tools are doing, and that they're not simply shifting one problem to another place. Ultimately, you want your code to have few defects, to work properly, and to be relatively easy to maintain. If that code also has a low CC, good for it. If your code meets these criteria and has a CC above 10, maybe it's time to sit down with whatever management you can and defend your code (and perhaps get them to examine their policy).
After browsing thru the wikipedia entry and on Thomas J. McCabe's original paper, it seems that the items you mentioned above are known problems with the metric.
However, most metrics do have pros and cons. I suppose in a large enough program the CC value could point to possibly complex parts of your code. But that higher CC does not necessarily mean complex.
Like all software metrics, CC is not perfect. Used on a big enough code base, it can give you an idea of where might be a problematic zone.
There are two things to keep in mind here:
Big enough code base: In any non trivial project you will have functions that have a really high CC value. So high that it does not matter if in one of your examples, the CC would be 2 or 3. A function with a CC of let's say over 300 is definitely something to analyse. Doesn't matter if the CC is 301 or 302.
Don't forget to use your head. There are methods that need many decision points. Often they can be refactored somehow to have fewer, but sometimes they can't. Do not go with a rule like "Refactor all methods with a CC > xy". Have a look at them and use your brain to decide what to do.
I like the idea of a weekly analysis. In quality control, trend analysis is a very effective tool for indentifying problems during their creation. This is so much better than having to wait until they get so big that they become obvious (see SPC for some details).
CC is not a panacea for measuring quality. Clearly a repeated statement is not "better" than a loop, even if a loop has a bigger CC. The reason the loop has a bigger CC is that sometimes it might get executed and sometimes it might not, which leads to two different "cases" which should both be tested. In your case the loop will always be executed three times because you use a constant, but CC is not clever enough to detect this.
Same with the chained ifs in example 2 - this structure allows you to have a statment which would be executed if only condition1 and condition2 is true. This is a special case which is not possible in the case using &&. So the if-chain has a bigger potential for special cases even if you dont utilize this in your code.
This is the danger of applying any metric blindly. The CC metric certainly has a lot of merit but as with any other technique for improving code it can't be evaluated divorced from context. Point your management at Casper Jone's discussion of the Lines of Code measurement (wish I could find a link for you). He points out that if Lines of Code is a good measure of productivity then assembler language developers are the most productive developers on earth. Of course they're no more productive than other developers; it just takes them a lot more code to accomplish what higher level languages do with less source code. I mention this, as I say, so you can show your managers how dumb it is to blindly apply metrics without intelligent review of what the metric is telling you.
I would suggest that if they're not, that your management would be wise to use the CC measure as a way of spotting potential hot spots in the code that should be reviewed further. Blindly aiming for the goal of lower CC without any reference to code maintainability or other measures of good coding is just foolish.
Cyclomatic complexity is analogous to temperature. They are both measurements, and in most cases meaningless without context. If I said the temperature outside was 72 degrees that doesn’t mean much; but if I added the fact that I was at North Pole, the number 72 becomes significant. If someone told me a method has a cyclomatic complexity of 10, I can’t determine if that is good or bad without its context.
When I code review an existing application, I find cyclomatic complexity a useful “starting point” metric. The first thing I check for are methods with a CC > 10. These “>10” methods are not necessarily bad. They just provide me a starting point for reviewing the code.
General rules when considering a CC number:
The relationship between CC # and # of tests, should be CC# <= #tests
Refactor for CC# only if it increases
maintainability
CC above 10 often indicates one or more Code Smells
[Off topic] If you favor readability over good score in the metrics (Was it J.Spolsky that said, "what's measured, get's done" ? - meaning that metrics are abused more often than not I suppose), it is often better to use a well-named boolean to replace your complex conditional statement.
then
if (condition1 && condition2 && condition3)
Console.WriteLine("wibble");
become
bool/boolean theWeatherIsFine = condition1 && condition2 && condition3;
if (theWeatherIsFine)
Console.WriteLine("wibble");
I'm no expert at this subject, but I thought I would give my two cents. And maybe that's all this is worth.
Cyclomatic Complexity seems to be just a particular automated shortcut to finding potentially (but not definitely) problematic code snippets. But isn't the real problem to be solved one of testing? How many test cases does the code require? If CC is higher, but number of test cases is the same and code is cleaner, don't worry about CC.
1.) There is no decision point there. There is one and only one path through the program there, only one possible result with either of the two versions. The first is more concise and better, Cyclomatic Complexity be damned.
1 test case for both
2.) In both cases, you either write "wibble" or you don't.
2 test cases for both
3.) First one could result in nothing, "one", or "one" and "one and two". 3 paths. 2nd one could result in nothing, either of the two, or both of them. 4 paths.
3 test cases for the first
4 test cases for the second