I am in a compilers class and we are tasked with creating our own language, from scratch. Currently our dilemma is whether to include a 'null' type or not. What purpose does null provide? Some of our team is arguing that it is not strictly necessary, while others are pro-null just for the extra flexibility it can provide.
Do you have any thoughts, especially for or against null?
Have you ever created functionality that required null?
Null: The Billion Dollar Mistake. Tony Hoare:
I call it my billion-dollar mistake.
It was the invention of the null
reference in 1965. At that time, I was
designing the first comprehensive type
system for references in an object
oriented language (ALGOL W). My goal
was to ensure that all use of
references should be absolutely safe,
with checking performed automatically
by the compiler. But I couldn't resist
the temptation to put in a null
reference, simply because it was so
easy to implement. This has led to
innumerable errors, vulnerabilities,
and system crashes, which have
probably caused a billion dollars of
pain and damage in the last forty
years. In recent years, a number of
program analysers like PREfix and
PREfast in Microsoft have been used to
check references, and give warnings if
there is a risk they may be non-null.
More recent programming languages like
Spec# have introduced declarations for
non-null references. This is the
solution, which I rejected in 1965.
null is a sentinel value that is not an integer, not a string, not a boolean - not anything really, except something to hold and be a "not there" value. Don't treat it as or expect it to be a 0, or an empty string or an empty list. Those are all valid values and can be geniunely valid values in many circumstances - the idea of a null instead means there is no value there.
Perhaps it's a little bit like a function throwing an exception instead of returning a value. Except instead of manufacturing and returning an ordinary value with a special meaning, it returns a special value that already has a special meaning. If a language expects you to work with null, then you can't really ignore it.
Oh no, I feel the philosophy major coming out of me....
The notion of NULL comes from the notion of the empty set in set theory. Nearly everyone agrees that the empty set is not equal to zero. Mathematicians and philosophers have been battling about the value of set theory for decades.
In programming languages, I think it is very helpful to understand object references that do not refer to anything in memory. Google about set theory and you will see similarities between the formal symbolic systems (notation) that set theorists use and symbols we use in many computer languages.
Regards,
Sam
What's null for you ask?
Well,
Nothing.
I usually think of 'null' in the C/C++ aspect of 'memory address 0'. It's not strictly needed, but if it didn't exist, then people would just use something else (if myNumber == -1, or if myString == "").
All I know is, I can't think of a day I've spent coding that I haven't typed the word "null", so I think that makes it pretty important.
In the .NET world, MS recently added nullable types for int, long, etc that never used to be nullable, so I guess they think its pretty important too.
If I was designing a lanaguage, I would keep it. However I wouldnt avoid using a language that didn't have null either. It would just take a little getting used too.
the concept of null is not strictly necessary in exactly the same sense that the concept of zero is not strictly necessary.
I don't think it's helpful to talk about null outside the context of the whole language design. First point of confusion: is the null type empty, or does it include a single, distinguished value (often called "nil")? A completely empty type is not very useful---although C uses the empty return type void to mark a procedure that is executed only for side effect, many other languages use a singleton type (usually the empty tuple) for this purpose.
I find that a nil value is used most effectively in dynamically typed languages. In Smalltalk it is the value used when you need a value but you don't have any information. In Lua it is used even more effectively: the nil value is the only value that cannot be a key or a value in a Lua table. In Lua, nil is also used as the value of missing parameters or results.
Overall I would say that a nil value can be useful in a dynamically typed setting, but in a statically typed setting, a null type is useful only for talking about functions (or procedures or methods) that are executed for side effect.
At all costs, avoid the NULL pointer used in C and Java. These are artifacts inherent in the implementations of pointers and objects, and in a well designed lanugage they should not be allowed. By all means give your users a way to extend an existing type with a null value, but make them do it explicitly, on purpose---don't force every type to have one by accident. (As an example of explicit use, I recently implemented Bentley and Sedgewick's ternary search trees in Haskell, and I needed to extend the character type with one additional value meaning 'not a character'. For this purpose Haskell provides the Maybe type.)
Finally, if you are writing a compiler, it is good to remember that the easiest parts of the language to compile, and the parts that cause the fewest bugs, are the parts that aren't there :-)
It seems useful to have a way to indicate a reference or pointer that isn't currently pointing at anything, whether you call it null, nil, None, etc. If for no other reason to let people know when they're about to fall off the end of a linked list.
In C NULL was (void*(0)), so it was a type with value(?). But that didn't work with C++ templates so C++ made NULL 0, it dropped the type and became a pure value.
However it was found that having a specific NULL type would be better so they (the C++ committee) decided that NULL will once again become a type (in C++0x).
Also almost every language besides C++ has NULL as a type, or an equivalent unique value not the same as 0 (it might be equal to it or not, but its not the same value).
So now even C++ will use NULL as a type, basically closing the discussions on the matter, since now everyone (almost) will have a NULL type
Edit: Thinking about it Haskell's maybe is another solution to NULL types, but its not as easy to grasp or implement.
A practical example of null is when you ask a yes/no question and don't get a response. You don't want to default to no because it might be important to know that the question wasn't answered in situations where the answer is very important.
Null is not a mistake.
Null means "I don't know yet"
For primitives you don't really need a null (I have to say that strings (in .NET) shouldn't get it IMHO)
But for composite entities it definitely serves a purpose.
You can think of any type as a set along with a collection of operations. There are many cases where it's convenient to have a value with isn't a "normal" value; for example, consider an "EOF" value. for C's getline(). You can handle that in one of several ways: you can have a NULL value outside the set, you can distinguish a particular value as null (in C, ((void *)0) can serve that purpose) or you can have a way of creating a new type, so that for type T, you create a type T' =def { T ∪ NULL }, which is the way Haskell does it (a "Maybe" type).
Which one is better is good for lots of enjoyable argument.
Null is only useful in situations where there are variables with unassigned values. If every variable has a value, then there is no need for null values.
Null is a sentinel value. It's a value that cannot possibly be real data and instead provides meta-data about the variable in use.
Null assigned to a pointer indicates that the pointer is uninitialized. This gives you the ability to detect misuse of uninitialized pointers by detecting dereferences of null valued pointers. If you instead leave the value of a pointer equal to whatever happened to be in memory then you would have crazily irregular program behavior that would be much more difficult to debug.
Also, the null character in a C-style variable length string is used to mark the end of the string.
The use of null in these ways, especially for pointer values, has become so popular that the metaphor has been imported into other systems, even when the "null" sentinel value is implemented entirely differently and has no relation to the number 0.
Null is not the problem - everyone treating, and interpreting null differently is the problem.
I like null. If there was no null, null would only be replaced with some other way for the code to say "I have no clue, dude!" (which some would write "I have no clue, man!", or "I have not a clue, old bean!" etc. and so, we'd have the exact same problems again).
I generalize, I know.
Consider the examples of C and of Java, for example. In C, the convention is that a null pointer is the numeric value zero. Of course, that's really just a convention: nothing about the language treats that value as anything special. In Java, however, null is a distinct concept that you can detect and know that, yes, this is in fact a bad reference and I shouldn't try to open that door to see what's on the other side.
Even so, I hate nulls almost worse than anything else.
CLARIFICATION based on comments: I hate the defacto null pointer value of zero worse than I hate null.
Any time I see an assignment to null, I think, "oh good, someone has just put a landmine in the code. Someday, we're going to be walking down a related execution path and BOOM! NullPointerException!"
What I would prefer is for someone to specify a useful default or NullObject that lets me know that "this parameter has not been set to anything useful." A bald null by itself is just trouble waiting to happen.
That said, it's still better than a raw zero wandering around loose.
That decision depends on the objective of the programing language.
Who are you designing the programing language for? Are you designing it for people who are familiar with c-derived languages? If so, then you should probably add support for null.
In general, I would say that you should avoid violating people's expectations unless it serves a particular purpose.
Take switch-blocks in C# as an example. All case labels in C# must have an explicit control-flow expression in every branch. That is they must all end with either a "break" statement or an explicit goto. That means that while this code is legal:
switch(x)
{
case 1:
case 2:
foo;
break;
}
That this code would not be legal:
switch (x)
{
case 1:
foo();
case 2:
bar();
break;
}
In order to create a "fall through" from case 1 to case 2, it's necessary to insert a goto, like this:
switch (x)
{
case 1:
foo();
goto case 2;
case 2:
bar();
break;
}
This is arguably something that would violate the expectations of C++ programmers who are leaning C#. However, adding that restriction serves a purpose. It eliminates the possibility of an entire class of common C++ bugs. It adds to the learning curve of the language slightly, but the result is a net benefit to the programmer.
If your goal is to design a language targeted at C++ programmers, then removing null would probably violate their expectations. That will cause confusion, and make your language more difficult to learn. The key question is then, "what benefit do they get"? Or, alternatively, "what detriment does this cause".
If you are simply trying to design a "super small language" that can be implemented in the course of a single semester, then the story is different. In that case your objective isn't to be build a useful language targeted at a particular segment of the population. Instead, it's just to learn how to create a compiler. In that scenario, having a smaller language is a big benefit, and so it's worth eliminating null.
So, to recap, I would say that you should:
Identify your goals in creating the language. Who is the language designed for, and what are their needs.
Make the decision based on what helps the target users meet their goals in the best way.
Usually this will make the desired result pretty clear.
Of course, if you don't explicitly articulate your design goals, or you can't agree on what they are, then you are still going to argue. In that case, however, you are pretty much doomed anyways.
One other way to look at null is that it's a performance issue. If you have a complex object containing other complex objects and so on, then it is more efficient to allow for all properties to initially become null instead of creating some kind of empty objects that won't be good for nothing and soon to be replaced.
That's just one perspective that I can't see mentioned before.
What purpose does null provide?
I believe there are two concepts of null at work here.
The first (null the logical indicator) is a conventional program language mechanism that provides runtime indication of a non-initialized memory reference in program logic.
The second (null the value) is a base data value that can be used in logical expressions to detect the logical null indicator (the previous definition) and make logical decisions in program code.
Do you have any thoughts, especially for or against null?
While null has been the bane of many programmers and the source of many application faults over the years, the null concept has validity. If you and your team create a language that uses memory references that can be potentially misused because the reference was not initialized, you will likely need a mechanism to detect that eventuality. It is always an option to create an alternative, but null is a widely known alternative.
Bottom line, it all depends upon the goals of your language:
target programming audience
robustness
performance
etc...
If robustness and program correctness are high on your priority list AND you allow programmatic memory references, you will want to consider null.
BB
If you are creating a statically typed language, I imagine that null could add a good deal of complexity to your compiler.
If you are creating a dynamically typed language, NULL can come in quite handy, as it is just another "type" without any variations.
Null is a placeholder that means that no value (append "of the correct type" for a static-typed language) can be assigned to that variable.
There is cognitive dissonance here. I heard somewhere else that humans cannot comprehend negation, because they must suppose a value and then imagine its unfitness.
My suggestion to your team is: come up with some examples programs that need to be written in your language, and see how they would look if you left out null, versus if you included it.
Use a null object pattern!
If you language is object oriented, let it have an UndefinedValue class of which only one singleton instance exists. Then use this instance wherever null is used. This has the advantage that your null will respond to messages such as #toString and #equals. You will never run into a null pointer exception as in Java. (Of course, this requires that your language is dynamically typed).
Null provides an easy way out for programmers who haven't completely thought through the logic and domains needed by their program, or the future maintenance implications of using a value with essentially no clear and agreed upon definition.
It may seem obvious at first that it must mean "no value", but what that ACTUALLY means depends on context. If, for instance LastName === null, does that mean that person doesn't have a last name, or that we don't know what their last name is, or that it hasn't be entered into the system yet? Does null equal itself, or doesn't it? In SQL it does not. In many languages it does. But if we don't know the value of personA.lastName, or personB.lastName, how can we know that personA.lastName === personB.lastName, eh? Should the result be false, or .. . null?
It depends on what you're doing, which is why it's dangerous and silly to have some kind of system wide value that can be used for any kind of situation that kind of looks like "nothing", since how other parts of your program and external libraries or modules can't really be depended upon to correctly interpret what you meant by "null".
You're much better off clearly defining the DOMAIN of possible values of lastName, and exactly what every possible value actually means, rather than depending on some vague systemwide notion of null, which may or may not have any relevance to what you're doing, depending on which language you're using, and what you're trying to do. A value, which may in fact, behave in exactly the wrong way when you begin to operate on your data.
Null is to objects what 0 is to numbers.
Related
Last night I was thining that programming languages can have a feature in which we should be able to constraints the values assigned to primitive data types.
For example I should be able to say my variable of type int can only have value between 0 and 100
int<0, 100> progress;
This would then act as a normal integer in all scenarios except the fact that you won't be able to specify values out of the range defined in constraint. The compiler will not compile the code progress=200.
This constraint can be carried over with type information.
Is this possible? Is it done in any programming language? If yes then which language has it and what is this technique called?
It is generally not possible. It makes little sense to use integers without any arithmetic operators. With arithmetic operators you have this:
int<0,100> x, u, v;
...
x = u + v; // is it in range?
If you're willing to do checks at run-time, then yes, several mainstream languages support it, starting with Pascal.
I believe Pascal (and Delphi) offers something similar with subrange types.
I think this is not possible at all in Java and in Ruby (well, in Ruby probably it is possible, but requires some effort). I have no idea about other languages, though.
Ada allows something like what you describe with ranges:
type My_Int is range 1..100;
So if you try assign a value to a My_Int that's less than 1 or greater than 100, Ada will raise the exception Constraint_Error.
Note that I've never used Ada. I've only read about this feature, so do your research before you plunge in.
It is certainly possible. There are many different techniques to do that, but 'dependent types' is the most popular.
The constraints can be even checked statically at compile time by compiler. See, for example, Agda2 and ATS (ats-lang.org).
Weaker forms of your 'range types' are possible without full dependent types, I think.
Some keywords to search for research papers:
- Guarded types
- Refinment types
- Subrange types
Certainly! In case you missed it: C. Do you C? You don't C? You don't count short as a constraint on Integer? Ok, so C only gives you pre-packaged constrained types.
BTW: It seems the answer that Pascal has subrange types misses the point of them. In Pascal array bounds violations are not possible. This is because the array index must of the same type as the array was declared with. In turn this means that to use an integer index you must coerce it down to the subrange, and that is where the run time check is done, not accessing the array.
This is a very important idea because it means a for loop over an array index type may access the array components safely without any run time checking.
Pascal has subranges. Ada extended that a bit, so you can do something like a subrange, or you can create an entirely new type with characteristics of the existing type, but not compatible with it (e.g., even if it was in the right range, you wouldn't be able to assign an Integer to your new type based off of Integer).
C++ doesn't support the idea directly, but is flexible enough that you can implement it if you want to. If you decide to support all the compound assignment operators (+=, -=, *=, etc.) this can be a lot of work though.
Other languages that support operator overloading (e.g., ML and company) can probably support it in much the same way as C++.
Also note that there are a few non-trivial decisions involved in the design. In particular, if the type is used in a way that could/does result in an intermediate result that overflows the specified range, but produces a final result that's within the specified range, what do you want to happen? Depending on your situation, that might be an error, or it might be entirely acceptable, and you'll have to decide which.
I really doubt that you can do that. Afterall these are primitive datatypes, with emphasis on primitive!
adding a constraint will make the type a subclass of its primitive state, thus extending it.
from wikipedia:
a basic type is a data type provided by a programming language as a basic building block. Most languages allow more complicated composite types to be recursively constructed starting from basic types.
a built-in type is a data type for which the programming language provides built-in support.
So personally, even if it is possible i wouldnt do, since its a bad practice. Instead just create an object that returns this type and the constraints (which i am sure you thought of this solution).
SQL has domains, which consist of a base type together with a dynamically-checked constraint. For example, you might define the domain telephone_number as a character string with an appropriate number of digits, valid area code, etc.
Why would a language NOT use Short-circuit evaluation? Are there any benefits of not using it?
I see that it could lead to some performances issues... is that true? Why?
Related question : Benefits of using short-circuit evaluation
Reasons NOT to use short-circuit evaluation:
Because it will behave differently and produce different results if your functions, property Gets or operator methods have side-effects. And this may conflict with: A) Language Standards, B) previous versions of your language, or C) the default assumptions of your languages typical users. These are the reasons that VB has for not short-circuiting.
Because you may want the compiler to have the freedom to reorder and prune expressions, operators and sub-expressions as it sees fit, rather than in the order that the user typed them in. These are the reasons that SQL has for not short-circuiting (or at least not in the way that most developers coming to SQL think it would). Thus SQL (and some other languages) may short-circuit, but only if it decides to and not necessarily in the order that you implicitly specified.
I am assuming here that you are asking about "automatic, implicit order-specific short-circuiting", which is what most developers expect from C,C++,C#,Java, etc. Both VB and SQL have ways to explicitly force order-specific short-circuiting. However, usually when people ask this question it's a "Do What I Meant" question; that is, they mean "why doesn't it Do What I Want?", as in, automatically short-circuit in the order that I wrote it.
One benefit I can think of is that some operations might have side-effects that you might expect to happen.
Example:
if (true || someBooleanFunctionWithSideEffect()) {
...
}
But that's typically frowned upon.
Ada does not do it by default. In order to force short-circuit evaluation, you have to use and then or or else instead of and or or.
The issue is that there are some circumstances where it actually slows things down. If the second condition is quick to calculate and the first condition is almost always true for "and" or false for "or", then the extra check-branch instruction is kind of a waste. However, I understand that with modern processors with branch predictors, this isn't so much the case. Another issue is that the compiler may happen to know that the second half is cheaper or likely to fail, and may want to reorder the check accordingly (which it couldn't do if short-circuit behavior is defined).
I've heard objections that it can lead to unexpected behavior of the code in the case where the second test has side effects. IMHO it is only "unexpected" if you don't know your language very well, but some will argue this.
In case you are interested in what actual language designers have to say about this issue, here's an excerpt from the Ada 83 (original language) Rationale:
The operands of a boolean expression
such as A and B can be evaluated in
any order. Depending on the complexity
of the term B, it may be more
efficient (on some but not all
machines) to evaluate B only when the
term A has the value TRUE. This
however is an optimization decision
taken by the compiler and it would be
incorrect to assume that this
optimization is always done. In other
situations we may want to express a
conjunction of conditions where each
condition should be evaluated (has
meaning) only if the previous
condition is satisfied. Both of these
things may be done with short-circuit
control forms ...
In Algol 60 one can achieve the effect
of short-circuit evaluation only by
use of conditional expressions, since
complete evaluation is performed
otherwise. This often leads to
constructs that are tedious to follow...
Several languages do not define how
boolean conditions are to be
evaluated. As a consequence programs
based on short-circuit evaluation will
not be portable. This clearly
illustrates the need to separate
boolean operators from short-circuit
control forms.
Look at my example at On SQL Server boolean operator short-circuit which shows why a certain access path in SQL is more efficient if boolean short circuit is not used. My blog example it shows how actually relying on boolean short-circuit can break your code if you assume short-circuit in SQL, but if you read the reasoning why is SQL evaluating the right hand side first, you'll see that is correct and this result in a much improved access path.
Bill has alluded to a valid reason not to use short-circuiting but to spell it in more detail: highly parallel architectures sometimes have problem with branching control paths.
Take NVIDIA’s CUDA architecture for example. The graphics chips use an SIMT architecture which means that the same code is executed on many parallel threads. However, this only works if all threads take the same conditional branch every time. If different threads take different code paths, evaluation is serialized – which means that the advantage of parallelization is lost, because some of the threads have to wait while others execute the alternative code branch.
Short-circuiting actually involves branching the code so short-circuit operations may be harmful on SIMT architectures like CUDA.
– But like Bill said, that’s a hardware consideration. As far as languages go, I’d answer your question with a resounding no: preventing short-circuiting does not make sense.
I'd say 99 times out of 100 I would prefer the short-circuiting operators for performance.
But there are two big reasons I've found where I won't use them.
(By the way, my examples are in C where && and || are short-circuiting and & and | are not.)
1.) When you want to call two or more functions in an if statement regardless of the value returned by the first.
if (isABC() || isXYZ()) // short-circuiting logical operator
//do stuff;
In that case isXYZ() is only called if isABC() returns false. But you may want isXYZ() to be called no matter what.
So instead you do this:
if (isABC() | isXYZ()) // non-short-circuiting bitwise operator
//do stuff;
2.) When you're performing boolean math with integers.
myNumber = i && 8; // short-circuiting logical operator
is not necessarily the same as:
myNumber = i & 8; // non-short-circuiting bitwise operator
In this situation you can actually get different results because the short-circuiting operator won't necessarily evaluate the entire expression. And that makes it basically useless for boolean math. So in this case I'd use the non-short-circuiting (bitwise) operators instead.
Like I was hinting at, these two scenarios really are rare for me. But you can see there are real programming reasons for both types of operators. And luckily most of the popular languages today have both. Even VB.NET has the AndAlso and OrElse short-circuiting operators. If a language today doesn't have both I'd say it's behind the times and really limits the programmer.
If you wanted the right hand side to be evaluated:
if( x < 13 | ++y > 10 )
printf("do something\n");
Perhaps you wanted y to be incremented whether or not x < 13. A good argument against doing this, however, is that creating conditions without side effects is usually better programming practice.
As a stretch:
If you wanted a language to be super secure (at the cost of awesomeness), you would remove short circuit eval. When something 'secure' takes a variable amount of time to happen, a Timing Attack could be used to mess with it. Short circuit eval results in things taking different times to execute, hence poking the hole for the attack. In this case, not even allowing short circuit eval would hopefully help write more secure algorithms (wrt timing attacks anyway).
The Ada programming language supported both boolean operators that did not short circuit (AND, OR), to allow a compiler to optimize and possibly parallelize the constructs, and operators with explicit request for short circuit (AND THEN, OR ELSE) when that's what the programmer desires. The downside to such a dual-pronged approach is to make the language a bit more complex (1000 design decisions taken in the same "let's do both!" vein will make a programming language a LOT more complex overall;-).
Not that I think this is what's going on in any language now, but it would be rather interesting to feed both sides of an operation to different threads. Most operands could be pre-determined not to interfere with each other, so they would be good candidates for handing off to different CPUs.
This kins of thing matters on highly parallel CPUs that tend to evaluate multiple branches and choose one.
Hey, it's a bit of a stretch but you asked "Why would a language"... not "Why does a language".
The language Lustre does not use short-circuit evaluation. In if-then-elses, both then and else branches are evaluated at each tick, and one is considered the result of the conditional depending on the evaluation of the condition.
The reason is that this language, and other synchronous dataflow languages, have a concise syntax to speak of the past. Each branch needs to be computed so that the past of each is available if it becomes necessary in future cycles. The language is supposed to be functional, so that wouldn't matter, but you may call C functions from it (and perhaps notice they are called more often than you thought).
In Lustre, writing the equivalent of
if (y <> 0) then 100/y else 100
is a typical beginner mistake. The division by zero is not avoided, because the expression 100/y is evaluated even on cycles when y=0.
Because short-circuiting can change the behavior of an application IE:
if(!SomeMethodThatChangesState() || !SomeOtherMethodThatChangesState())
I'd say it's valid for readability issues; if someone takes advantage of short circuit evaluation in a not fully obvious way, it can be hard for a maintainer to look at the same code and understand the logic.
If memory serves, erlang provides two constructs, standard and/or, then andalso/orelse . This clarifies intend that 'yes, I know this is short circuiting, and you should too', where as at other points the intent needs to be derived from code.
As an example, say a maintainer comes across these lines:
if(user.inDatabase() || user.insertInDatabase())
user.DoCoolStuff();
It takes a few seconds to recognize that the intent is "if the user isn't in the Database, insert him/her/it; if that works do cool stuff".
As others have pointed out, this is really only relevant when doing things with side effects.
I don't know about any performance issues, but one possible argumentation to avoid it (or at least excessive use of it) is that it may confuse other developers.
There are already great responses about the side-effect issue, but I didn't see anything about the performance aspect of the question.
If you do not allow short-circuit evaluation, the performance issue is that both sides must be evaluated even though it will not change the outcome. This is usually a non-issue, but may become relevant under one of these two circumstances:
The code is in an inner loop that is called very frequently
There is a high cost associated with evaluating the expressions (perhaps IO or an expensive computation)
The short-circuit evaluation automatically provides conditional evaluation of a part of the expression.
The main advantage is that it simplifies the expression.
The performance could be improved but you could also observe a penalty for very simple expressions.
Another consequence is that side effects of the evaluation of the expression could be affected.
In general, relying on side-effect is not a good practice, but in some specific context, it could be the preferred solution.
VB6 doesn't use short-circuit evaluation, I don't know if newer versions do, but I doubt it. I believe this is just because older versions didn't either, and because most of the people who used VB6 wouldn't expect that to happen, and it would lead to confusion.
This is just one of the things that made it extremely hard for me to get out of being a noob VB programmer who wrote spaghetti code, and get on with my journey to be a real programmer.
Many answers have talked about side-effects. Here's a Python example without side-effects in which (in my opinion) short-circuiting improves readability.
for i in range(len(myarray)):
if myarray[i]>5 or (i>0 and myarray[i-1]>5):
print "At index",i,"either arr[i] or arr[i-1] is big"
The short-circuit ensures we don't try to access myarray[-1], which would raise an exception since Python arrays start at 0. The code could of course be written without short-circuits, e.g.
for i in range(len(myarray)):
if myarray[i]<=5: continue
if i==0: continue
if myarray[i-1]<=5: continue
print "At index",i,...
but I think the short-circuit version is more readable.
I will expand here on a comment I made to When a method has too many parameters? where the OP was having minor problems with someone else's function which had 97 parameters.
I am a great believer in writing maintainable code (and it is often easier to write than to read, hence Steve McConnell(praise be upon his name)'s phrase "write only code").
Since statistics how that most car accidents happen at junctions and my experience (ymmv) shows that most "anomalies" occur at interfaces, I will list some things that I do to attempt to avoid misunderstandings at interfaces and invite your comments if I am going badly wrong.
But, more importantly, I invite your suggestions for making things even more prophylactic (see, there is a question after all - how to improve things?).
Adequate documentation, in the form of (up to date) DoxyGen format comments describing the nature and porpoise of each parameter.
absolutely NO back-door shenanigans with global variables as hidden parameters.
try to limit parameters to six or eight. If more, pass related parameters as a structure; if they are not related then seriously reconsider the function. If it needs so much information, is it too complex to maintain? Can it be broken down into several smaller functions?
use the CONST as often as possible and meaningful.
a coding standard that says that input parameters come first, then output only, and finally input/output, which are modified by the function.
I also #define some empty macros to make declarations even easier to read:
#define INPUT
#define OUTPUT
#define MODIFY
bool DoSomething(INPUT int howOften, MODIFY Wdiget *myWidget, OUTPUT WidgetPtr * const nextWidget)
Just a few ideas. How can I improve on these? Thanks.
Addressing your points in order:
Well-designed types usually render Doxygen format comments a waste of time.
While true as stated ("shenanigans" are bad by definition), not all use of globals is really as bad as many people imply. If you have to pass a parameter more than about four times before it's really used, chances are that a global will be less error prone.
Eight or even six parameters is usually excessive. Any more than two or three starts to indicate that the function is doing more than one thing. One obvious exception is a constructor that aggregates a number of other items into an object (e.g. an address object that takes a street name, number, city, country, postal code, etc., as inputs).
Better stated as "write const-correct code."
Given C++'s default parameter capability, it's generally best to sort in ascending order of likelihood to use a default value.
Don't. Just don't! If it's not obvious what are inputs and what are outputs, that pretty much proves that the basic design is fatally flawed.
As for ideas I think are actually good:
As implied in the first point, concentrate on types. Once you get them right, most of the other problems just disappear.
Use a few (even just one) central theme(s). For Lisp, everything is a list. For Unix, everything is a file (and files are all simple streams of bytes). Emulate this simplicity.
Edit: replying to comments:
While you do have something of a point, my experience still indicates that documentation produced with Doxygen (and similar such as javadoc) is almost universally useless. In theory the tool doesn't prevent decent documentation, but in fact it's rare at best.
Globals certainly can cause problems -- but I'm old enough to have used Fortran back before it provided much alternative, and with some care it really wasn't nearly as bad as many people imply. A lot of the stories seem to be at least third hand, with a bit of extra "spice" added each time they're re-told. I've seen one story that sounds a lot like an exaggerated version of one I told a couple decades ago or so...
Hm...Markdown formatting doesn't seem to approve of my skipping numbers.
And again...
My comment was specific to C++, but quite a few other languages also support default parameters and/or overloading, and it can apply about as well to most of them. Even without it, a call like f(param1, param2, 0,0,0); is pretty easy to see as having default parameters. To an extent, ordering by usage is handy, but when you do the order you pick doesn't matter nearly as much as simply being consistent.
True, a void * parameter doesn't tell you much -- but a MODIFY void * is little better. A real type and consistent use of const provides far more information and gets checked by the compiler. Other languages may not have/use const, but they probably don't have macros either. OTOH, some directly support what you want -- e.g., Ada has in, out and inout specifiers.
I am not sure we will end at a single point of agreement about this, everyone will come up with different ideas (good or bad in each others perspective). Having said that, i find Code Complete to be a good place to go to when I am stuck with this sort of problems.
A big peeve of mine is control coupling between functions. (Control coupling is when one module controls the execution flow of another, by passing flags telling the called function what to do.)
For example (cut & paste from code I just had to work on):
void UartEnable(bool enable, int baud);
as opposed to:
void UartEnable(int baud);
void UartDisable(void);
Put another way -- parameters are for passing "data", not "control".
I'd use the 'rule' put forward by Uncle Bob in his book Clean Code.
These the ones I think I remember:
2 parameters are ok, 3 are bad, more need refactoring
Comments are a sign of bad names. So there should be none, and the purpose of the function and the parameters should be clear from the names
make the method short. Aim for below 10 lines of code.
A rough definition of a data structure is that it allows you to store data and apply a set of operations on that data while preserving consistency of data before and after the operation.
However some people insist that a primitive variable like 'int' can also be considered as a data structure. I get that part where it allows you to store data but I guess the operation part is missing. Primitive variables don't have operations attached to them. So I feel that unless you have a set of operations defined and attached to it you cannot call it a data structure. 'int' doesn't have any operation attached to it, it can be operated upon with a set of generic operators.
Please advise if I got something wrong here.
To say that something is structured implies that there is a form or formatting that defines HOW the data is structured. Note this has nothing to do with how the data is actually stored. You could for example create a data structure that exists entirely within a single Integer, yet represents a number of different values.
A data structure is an arbitrary construct used to describe how to store data in a system. It may be as simple as a single primitive, or as complex as a class. So the answer is largely subjective. It's "yes" if you decide to use a primitive as such, that a simple primitive may be considered a primitive data structure, because it describes HOW you wish to store an element of data. The answer is also "no", because it describes an element of a structure and not necessarily the whole structure in itself.
As for how this relates to operations, strictly speaking a data structure has nothing to do with behaviour, it is simply a storage mechanism. Preserving consistency of data is really a behavioural thing. Yes, your compiler probably spits out errors if you try to shoe-horn a 32-bit value into a Byte, but that's symptomatic of the behaviour of the system (Ie: compilation) acting on the data structure of your application, of which your primitives are an element.
I don't think your definition of data structure is correct.
It seems to me that a struct (with no methods) is a valid data structure, but it has no real 'operations'. And that's not important. It's holding data.
To that end, and int holds data, an Object holds data. They are data structures (technically).
That said, I don't ever find myself saying "What datastructure shall I use? I know! an int!".
I would say you need to re-evaluate the meaning of "data structure".
Your definition of a data structure isn't quite correct. A data structure doesn't necessarily have any attached behaviors or operations. An ADT or Abstract Data Type is what you are describing as a data structure. An ADT includes the data and the behaviors or operations that work on that data. An int by itself is not an ADT, but I suppose you could call it a data structure. If you encapsulate an int and its operations then you have an ADT which is what I think you are trying to describe as a data structure. Classes provide a mechanism for implementation of ADTs in modern languages.
wikipedia has a good description of abstract data types.
I would argue that "int" is a data structure - it has a defined representation and meaning. That is, depending on your system, it has a specific length, a specific set of operators available to it, and a specified representation (be it twos-compliment). It is designed to hold "integer numbers".
Practically, the distinction isn't particularly relevant.
Primitives do have operations attached to them; however, they may not be in the format of methods as you would expect in an object-oriented paradigm.
Assignment =, addition +, subtraction -, comparison ==, etc are all operations. Especially if you consider that you can explicitly define, override, or overload these operations for arbitrary classes (i.e.: data structures) in some languages (e.g. C++), then the primitive int, char, or what have you, are not very different.
Primitive variables don't have operations attached to them. So I feel that unless you have a set of operations defined and attached to it you cannot call it a data structure.
'int' doesn't have any operation attached to it, it can be operated upon with a set of generic operators.
Generic? Then how come 2+2 works, but "ninja" + List<float> doesn't? If the operator was generic, it'd work on anything. It doesn't. It only works on a few predefined types, such as integers.
Ints certainly have a set of operations defined on them. Arithmetic operations such as addition, subtraction, multiplication or division, for example. Most languages also have some kind of ToString()-like functionality defined on integers. But you can't do just anything with an int. For example, you can't pass an int to a function expecting a string. ints have a very specific set of operations defined on them. Those operations just aren't member methods. They come in the form of operators and non-member functions or member methods of other classes. But they are still operations that work on integers.
I don't think, (ref: Wikipedia entry) that Data Structure includes the definition of permissible operations (on operators). Of course, we could cite the C++ class as a counter-example in which we can define overloaded operators. At the same time, we define a structure as just a composite/user defined datatype and do not declare any permissible operations on them. We allow the compiler to figure that out.
'int' doesn't have any operation attached to it, it can be operated upon with a set of generic operators.
Operations are intrinsically linked with the things that they operate on; there's no such thing as generic operations.
This is true in the mathematical sense (< works for in set of integers, but has no meaning for complex numbers), and also in the computer scientific sense (evaluating a + b requires that a and b are or can be converted to compatible types on which the + operation is defined).
Of course, it depends what you mean by "data structure." Others have focused on whether your definition is correct and raise good questions. But what if we say, "Let's ignore the term for now, and focus on what you described?" In other words, what if look at
A piece of data that has a designated interpretation of its value
A set of operations on that data
Then certainly, int qualifies. (If there were no operations on int, we'd all be stuck!)
For a more mathematical approach to programming that begins with these questions, and takes them to what some have called "an algebra of computation," see Elements of Programming by Alex Stepanov and Paul McJones.
I often find myself trying to come up with good names for complementary pairs of variables; where two variables denote opposing concepts, two participants in some sort of duologue, and so on.
This might be better explained by a counter-example - I maintain an app that prints two graphics as part of a print advertisement. They're stored in the database as TopLogo and LowerLogo, which I have to stop and double-check every time I use them because I'm expecting top to complement bottom, and lower should complement upper.
There's some obvious examples that I think work well:
client / server
source / target for copying/moving data or files from one variable to another
minimum / maximum
but there's some concepts that just don't lend themselves to such neat naming schemes. For example, when paging through records, does 'last' mean 'final' or 'previous' ? I recently saw some code that used firstPage, previousPage, nextPage and finalPage to avoid the ambiuous lastPage completely, which I thought was very beat, hence this question.
Do you have any particularly neat variable name pairs you'd care to share with us? (Bonus points if they're the same length, which makes the code so much neater in monospaced fonts.)
Like with all kinds of code style conventions, consistency is what you should strive for.
I would have the development team agree on "standard" pairs of prefixes for common scenarios like "source/destination" or "from/to" and then stick with them for the whole project. As long as every developer is aware of what is meant with a particular prefix in the codebase, it is easier to avoid misunderstandings.
Exceptions to the rule should be clarified in the documentation if the variable is part of a public API, or in comments within the code, if it's visibility is restricted to a single class or method.
In my databases you'll find many valid-state temporal ("history") tables containing a pair of columns named start_date and end_date. No bonus points for me, then, because I'd rather use the commonly used 'end' than try to come up with an intuitive alternative with the same number of characters as the word 'start'.
I tend to prefer these generic terms even when more context-specific terms may be viable e.g. preferring employee_start_date over employee_hire_date (what if their employment started for a reason other than being formally hiring e.g. their company was the subject of an acquisition). That said, I'd prefer person_birth_date over person_start_date :)
While one does try to be semantically coherent in obvious cases -- e.g., maximum goes with minimum, and not "lowest" -- in well-structured OO code (which isn't all code, I know) the problem disappears with a good IDE. Classes are short, methods are short, and variables are few in each method. So it doesn't matter what you call the variable pairs so long as they're clear. Your code might not look professional, but real quality is in the code, not in the look of your code.
The problem further disappears if there is good JavaDoc or whatever the documentation system is, and if have good Class names that go with them. For instance, if you have an instance of a Connection class and it has a method a method called setDestination, that's okay, but if you know that setDestination takes one parameter called destination and it's of the Server class, you're cool... even though you might prefer to call it target, aimHere, placeToSendTheData, or whatever (and the corresponding names, source, comingFromHere, and placeToGetTheDataFrom). Plus the doc system says what the thing is for, and that is priceless.
This next thing might sound stupid and I'm sure I'll get voted down here on StackOverflow, but unique non-professional sounding variable names have a great advantage: I know that my variables have names like placeWeWantTheDataToGo (and the IDE takes care of typing it), but the "serious" guys who do the JDK would never use such silly names. So I know immediately that the variable is one of mine. Incidentally, when I worked with developers in Spain and Italy, they write code with Spanish variable names (not always, but usually). This causes the same effect: we can quickly see that the Conexion class is ours, but the Connection class is not.
[Also, instead of typing your variable names, assign them a constant String somewhere in your code and use that, so if they called it lower or downer instead of low, you're still okay.]
Yes, I do try to name complementary sets of variables systematically so that the symmetry is clear. It is not always easy; sometimes, not even possible. Well, not possible using the rules I lay down for myself - which means I usually try to have the names the same length. The 'top' and 'lower' example would drive me batty (assuming I'm not batty already, which is far from certain); I'd probably use 'upper' and 'lower' because those are the same length; 'top' and 'bottom' would frustrate me too because of the difference in length.