Conditional logging with minimal cyclomatic complexity - language-agnostic

After reading "What’s your/a good limit for cyclomatic complexity?", I realize many of my colleagues were quite annoyed with this new QA policy on our project: no more 10 cyclomatic complexity per function.
Meaning: no more than 10 'if', 'else', 'try', 'catch' and other code workflow branching statement. Right. As I explained in 'Do you test private method?', such a policy has many good side-effects.
But: At the beginning of our (200 people - 7 years long) project, we were happily logging (and no, we can not easily delegate that to some kind of 'Aspect-oriented programming' approach for logs).
myLogger.info("A String");
myLogger.fine("A more complicated String");
...
And when the first versions of our System went live, we experienced huge memory problem not because of the logging (which was at one point turned off), but because of the log parameters (the strings), which are always calculated, then passed to the 'info()' or 'fine()' functions, only to discover that the level of logging was 'OFF', and that no logging were taking place!
So QA came back and urged our programmers to do conditional logging. Always.
if(myLogger.isLoggable(Level.INFO) { myLogger.info("A String");
if(myLogger.isLoggable(Level.FINE) { myLogger.fine("A more complicated String");
...
But now, with that 'can-not-be-moved' 10 cyclomatic complexity level per function limit, they argue that the various logs they put in their function is felt as a burden, because each "if(isLoggable())" is counted as +1 cyclomatic complexity!
So if a function has 8 'if', 'else' and so on, in one tightly-coupled not-easily-shareable algorithm, and 3 critical log actions... they breach the limit even though the conditional logs may not be really part of said complexity of that function...
How would you address this situation ?
I have seen a couple of interesting coding evolution (due to that 'conflict') in my project, but I just want to get your thoughts first.
Thank you for all the answers.
I must insist that the problem is not 'formatting' related, but 'argument evaluation' related (evaluation that can be very costly to do, just before calling a method which will do nothing)
So when a wrote above "A String", I actually meant aFunction(), with aFunction() returning a String, and being a call to a complicated method collecting and computing all kind of log data to be displayed by the logger... or not (hence the issue, and the obligation to use conditional logging, hence the actual issue of artificial increase of 'cyclomatic complexity'...)
I now get the 'variadic function' point advanced by some of you (thank you John).
Note: a quick test in java6 shows that my varargs function does evaluate its arguments before being called, so it can not be applied for function call, but for 'Log retriever object' (or 'function wrapper'), on which the toString() will only be called if needed. Got it.
I have now posted my experience on this topic.
I will leave it there until next Tuesday for voting, then I will select one of your answers.
Again, thank you for all the suggestions :)

With current logging frameworks, the question is moot
Current logging frameworks like slf4j or log4j 2 don't require guard statements in most cases. They use a parameterized log statement so that an event can be logged unconditionally, but message formatting only occurs if the event is enabled. Message construction is performed as needed by the logger, rather than pre-emptively by the application.
If you have to use an antique logging library, you can read on to get more background and a way to retrofit the old library with parameterized messages.
Are guard statements really adding complexity?
Consider excluding logging guards statements from the cyclomatic complexity calculation.
It could be argued that, due to their predictable form, conditional logging checks really don't contribute to the complexity of the code.
Inflexible metrics can make an otherwise good programmer turn bad. Be careful!
Assuming that your tools for calculating complexity can't be tailored to that degree, the following approach may offer a work-around.
The need for conditional logging
I assume that your guard statements were introduced because you had code like this:
private static final Logger log = Logger.getLogger(MyClass.class);
Connection connect(Widget w, Dongle d, Dongle alt)
throws ConnectionException
{
log.debug("Attempting connection of dongle " + d + " to widget " + w);
Connection c;
try {
c = w.connect(d);
} catch(ConnectionException ex) {
log.warn("Connection failed; attempting alternate dongle " + d, ex);
c = w.connect(alt);
}
log.debug("Connection succeeded: " + c);
return c;
}
In Java, each of the log statements creates a new StringBuilder, and invokes the toString() method on each object concatenated to the string. These toString() methods, in turn, are likely to create StringBuilder instances of their own, and invoke the toString() methods of their members, and so on, across a potentially large object graph. (Before Java 5, it was even more expensive, since StringBuffer was used, and all of its operations are synchronized.)
This can be relatively costly, especially if the log statement is in some heavily-executed code path. And, written as above, that expensive message formatting occurs even if the logger is bound to discard the result because the log level is too high.
This leads to the introduction of guard statements of the form:
if (log.isDebugEnabled())
log.debug("Attempting connection of dongle " + d + " to widget " + w);
With this guard, the evaluation of arguments d and w and the string concatenation is performed only when necessary.
A solution for simple, efficient logging
However, if the logger (or a wrapper that you write around your chosen logging package) takes a formatter and arguments for the formatter, the message construction can be delayed until it is certain that it will be used, while eliminating the guard statements and their cyclomatic complexity.
public final class FormatLogger
{
private final Logger log;
public FormatLogger(Logger log)
{
this.log = log;
}
public void debug(String formatter, Object... args)
{
log(Level.DEBUG, formatter, args);
}
… &c. for info, warn; also add overloads to log an exception …
public void log(Level level, String formatter, Object... args)
{
if (log.isEnabled(level)) {
/*
* Only now is the message constructed, and each "arg"
* evaluated by having its toString() method invoked.
*/
log.log(level, String.format(formatter, args));
}
}
}
class MyClass
{
private static final FormatLogger log =
new FormatLogger(Logger.getLogger(MyClass.class));
Connection connect(Widget w, Dongle d, Dongle alt)
throws ConnectionException
{
log.debug("Attempting connection of dongle %s to widget %s.", d, w);
Connection c;
try {
c = w.connect(d);
} catch(ConnectionException ex) {
log.warn("Connection failed; attempting alternate dongle %s.", d);
c = w.connect(alt);
}
log.debug("Connection succeeded: %s", c);
return c;
}
}
Now, none of the cascading toString() calls with their buffer allocations will occur unless they are necessary! This effectively eliminates the performance hit that led to the guard statements. One small penalty, in Java, would be auto-boxing of any primitive type arguments you pass to the logger.
The code doing the logging is arguably even cleaner than ever, since untidy string concatenation is gone. It can be even cleaner if the format strings are externalized (using a ResourceBundle), which could also assist in maintenance or localization of the software.
Further enhancements
Also note that, in Java, a MessageFormat object could be used in place of a "format" String, which gives you additional capabilities such as a choice format to handle cardinal numbers more neatly. Another alternative would be to implement your own formatting capability that invokes some interface that you define for "evaluation", rather than the basic toString() method.

In Python you pass the formatted values as parameters to the logging function. String formatting is only applied if logging is enabled. There's still the overhead of a function call, but that's minuscule compared to formatting.
log.info ("a = %s, b = %s", a, b)
You can do something like this for any language with variadic arguments (C/C++, C#/Java, etc).
This isn't really intended for when the arguments are difficult to retrieve, but for when formatting them to strings is expensive. For example, if your code already has a list of numbers in it, you might want to log that list for debugging. Executing mylist.toString() will take a while to no benefit, as the result will be thrown away. So you pass mylist as a parameter to the logging function, and let it handle string formatting. That way, formatting will only be performed if needed.
Since the OP's question specifically mentions Java, here's how the above can be used:
I must insist that the problem is not 'formatting' related, but 'argument evaluation' related (evaluation that can be very costly to do, just before calling a method which will do nothing)
The trick is to have objects that will not perform expensive computations until absolutely needed. This is easy in languages like Smalltalk or Python that support lambdas and closures, but is still doable in Java with a bit of imagination.
Say you have a function get_everything(). It will retrieve every object from your database into a list. You don't want to call this if the result will be discarded, obviously. So instead of using a call to that function directly, you define an inner class called LazyGetEverything:
public class MainClass {
private class LazyGetEverything {
#Override
public String toString() {
return getEverything().toString();
}
}
private Object getEverything() {
/* returns what you want to .toString() in the inner class */
}
public void logEverything() {
log.info(new LazyGetEverything());
}
}
In this code, the call to getEverything() is wrapped so that it won't actually be executed until it's needed. The logging function will execute toString() on its parameters only if debugging is enabled. That way, your code will suffer only the overhead of a function call instead of the full getEverything() call.

In languages supporting lambda expressions or code blocks as parameters, one solution for this would be to give just that to the logging method. That one could evaluate the configuration and only if needed actually call/execute the provided lambda/code block.
Did not try it yet, though.
Theoretically this is possible. I would not like to use it in production due to performance issues i expect with that heavy use of lamdas/code blocks for logging.
But as always: if in doubt, test it and measure the impact on cpu load and memory.

Thank you for all your answers! You guys rock :)
Now my feedback is not as straight-forward as yours:
Yes, for one project (as in 'one program deployed and running on its own on a single production platform'), I suppose you can go all technical on me:
dedicated 'Log Retriever' objects, which can be pass to a Logger wrapper only calling toString() is necessary
used in conjunction with a logging variadic function (or a plain Object[] array!)
and there you have it, as explained by #John Millikin and #erickson.
However, this issue forced us to think a little about 'Why exactly we were logging in the first place ?'
Our project is actually 30 different projects (5 to 10 people each) deployed on various production platforms, with asynchronous communication needs and central bus architecture.
The simple logging described in the question was fine for each project at the beginning (5 years ago), but since then, we has to step up. Enter the KPI.
Instead of asking to a logger to log anything, we ask to an automatically created object (called KPI) to register an event. It is a simple call (myKPI.I_am_signaling_myself_to_you()), and does not need to be conditional (which solves the 'artificial increase of cyclomatic complexity' issue).
That KPI object knows who calls it and since he runs from the beginning of the application, he is able to retrieve lots of data we were previously computing on the spot when we were logging.
Plus that KPI object can be monitored independently and compute/publish on demand its information on a single and separate publication bus.
That way, each client can ask for the information he actually wants (like, 'has my process begun, and if yes, since when ?'), instead of looking for the correct log file and grepping for a cryptic String...
Indeed, the question 'Why exactly we were logging in the first place ?' made us realize we were not logging just for the programmer and his unit or integration tests, but for a much broader community including some of the final clients themselves. Our 'reporting' mechanism had to be centralized, asynchronous, 24/7.
The specific of that KPI mechanism is way out of the scope of this question. Suffice it to say its proper calibration is by far, hands down, the single most complicated non-functional issue we are facing. It still does bring the system on its knee from time to time! Properly calibrated however, it is a life-saver.
Again, thank you for all the suggestions. We will consider them for some parts of our system when simple logging is still in place.
But the other point of this question was to illustrate to you a specific problem in a much larger and more complicated context.
Hope you liked it. I might ask a question on KPI (which, believe or not, is not in any question on SOF so far!) later next week.
I will leave this answer up for voting until next Tuesday, then I will select an answer (not this one obviously ;) )

Maybe this is too simple, but what about using the "extract method" refactoring around the guard clause? Your example code of this:
public void Example()
{
if(myLogger.isLoggable(Level.INFO))
myLogger.info("A String");
if(myLogger.isLoggable(Level.FINE))
myLogger.fine("A more complicated String");
// +1 for each test and log message
}
Becomes this:
public void Example()
{
_LogInfo();
_LogFine();
// +0 for each test and log message
}
private void _LogInfo()
{
if(!myLogger.isLoggable(Level.INFO))
return;
// Do your complex argument calculations/evaluations only when needed.
}
private void _LogFine(){ /* Ditto ... */ }

In C or C++ I'd use the preprocessor instead of the if statements for the conditional logging.

Pass the log level to the logger and let it decide whether or not to write the log statement:
//if(myLogger.isLoggable(Level.INFO) {myLogger.info("A String");
myLogger.info(Level.INFO,"A String");
UPDATE: Ah, I see that you want to conditionally create the log string without a conditional statement. Presumably at runtime rather than compile time.
I'll just say that the way we've solved this is to put the formatting code in the logger class so that the formatting only takes place if the level passes. Very similar to a built-in sprintf. For example:
myLogger.info(Level.INFO,"A String %d",some_number);
That should meet your criteria.

Conditional logging is evil. It adds unnecessary clutter to your code.
You should always send in the objects you have to the logger:
Logger logger = ...
logger.log(Level.DEBUG,"The foo is {0} and the bar is {1}",new Object[]{foo, bar});
and then have a java.util.logging.Formatter that uses MessageFormat to flatten foo and bar into the string to be output. It will only be called if the logger and handler will log at that level.
For added pleasure you could have some kind of expression language to be able to get fine control over how to format the logged objects (toString may not always be useful).

(source: scala-lang.org)
Scala has a annontation #elidable() that allows you to remove methods with a compiler flag.
With the scala REPL:
C:>scala
Welcome to Scala version 2.8.0.final (Java HotSpot(TM) 64-Bit Server VM, Java 1.
6.0_16).
Type in expressions to have them evaluated.
Type :help for more information.
scala> import scala.annotation.elidable
import scala.annotation.elidable
scala> import scala.annotation.elidable._
import scala.annotation.elidable._
scala> #elidable(FINE) def logDebug(arg :String) = println(arg)
logDebug: (arg: String)Unit
scala> logDebug("testing")
scala>
With elide-beloset
C:>scala -Xelide-below 0
Welcome to Scala version 2.8.0.final (Java HotSpot(TM) 64-Bit Server VM, Java 1.
6.0_16).
Type in expressions to have them evaluated.
Type :help for more information.
scala> import scala.annotation.elidable
import scala.annotation.elidable
scala> import scala.annotation.elidable._
import scala.annotation.elidable._
scala> #elidable(FINE) def logDebug(arg :String) = println(arg)
logDebug: (arg: String)Unit
scala> logDebug("testing")
testing
scala>
See also Scala assert definition

As much as I hate macros in C/C++, at work we have #defines for the if part, which if false ignores (does not evaluate) the following expressions, but if true returns a stream into which stuff can be piped using the '<<' operator.
Like this:
LOGGER(LEVEL_INFO) << "A String";
I assume this would eliminate the extra 'complexity' that your tool sees, and also eliminates any calculating of the string, or any expressions to be logged if the level was not reached.

Here is an elegant solution using ternary expression
logger.info(logger.isInfoEnabled() ? "Log Statement goes here..." : null);

Consider a logging util function ...
void debugUtil(String s, Object… args) {
if (LOG.isDebugEnabled())
LOG.debug(s, args);
}
);
Then make the call with a "closure" round the expensive evaluation that you want to avoid.
debugUtil(“We got a %s”, new Object() {
#Override String toString() {
// only evaluated if the debug statement is executed
return expensiveCallToGetSomeValue().toString;
}
}
);

Related

What is the purpose of the org.apache.hadoop.mapreduce.Mapper.run() function in Hadoop?

What is the purpose of the org.apache.hadoop.mapreduce.Mapper.run() function in Hadoop? The setup() is called before calling the map() and the clean() is called after the map(). The documentation for the run() says
Expert users can override this method for more complete control over the execution of the Mapper.
I am looking for the practical purpose of this function.
The default run() method simply takes each key / value pair supplied by the context and calls the map() method:
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
}
If you wanted to do more than that ... you'd need to override it. For example, the MultithreadedMapper class
I just came up with a fairly odd case for using this.
Occasionally I've found that I want a mapper that consumes all its input before producing any output. I've done it in the past by performing the record writes in my cleanup function. My map function doesn't actually output any records, it just reads the input and stores whatever will be needed in private structures.
It turns out that this approach works fine unless you're producing a LOT of output. The best I can make out is that the mapper's spill facility doesn't operate during cleanup. So the records that are produced just keep accumulating in memory, and if there are too many of them you risk heap exhaustion. This is my speculation of what's going on - could be wrong. But definitely the problem goes away with my new approach.
That new approach is to override run() instead of cleanup(). My only change to the default run() is that after the last record has been delivered to map(), I call map() once more with null key and value. That's a signal to my map() function to go ahead and produce its output. In this case, with the spill facility still operable, memory usage stays in check.
Maybe it could be used for debugging purposes as well. You can then skip part of the input key-value pairs (=take a sample) to test your code.

Function Parameter best practice

I have question regarding the use of function parameters.
In the past I have always written my code such that all information needed by a function is passed in as a parameter. I.e. global parameters are not used.
However through looking over other peoples code, functions without parameters seem to be the norm. I should note that these are for private functions of a class and that the values that would have been passed in as paramaters are in fact private member variables for that class.
This leads to neater looking code and im starting to lean towards this for private functions but would like other peoples views.
E.g.
Start();
Process();
Stop();
is neater and more readable than:
ParamD = Start(paramA, ParamB, ParamC);
Process(ParamA, ParamD);
Stop(ParamC);
It does break encapsulation from a method point of view but not from a class point of view.
There's nothing wrong in principle with having functions access object fields, but the particular example you give scares me, because the price of simplifying your function calls is that you're obfuscating the life cycle of your data.
To translate your args example into fields, you'd have something like:
void Start() {
// read FieldA, FieldB, and FieldC
// set the value of FieldD
}
void Process() {
// read FieldA and do something
// read FieldD and do something
}
void Stop() {
// read the value of FieldC
}
Start() sets FieldD by side effect. This means that it's probably not valid to call Process() until after you've called Start(). But the code doesn't tell you that. You only find out by searching to see where FieldD is initialized. This is asking for bugs.
My rule of thumb is that functions should only access an object field if it's always safe to access that field. Best if it's a field that's initialized at construction time, but a field that stores a reference to a collaborator object or something, which could change over time, is okay too.
But if it's not valid to call one function except after another function has produced some output, that output should be passed in, not stored in the state. If you treat each function as independent, and avoid side effects, your code will be more maintainable and easier to understand.
As you mentioned, there's a trade-off between them. There's no hard rule for always preferring one to another. Minimizing the scope of variables will keep their side effect local, the code more modular and reusable and debugging easier. However, it can be an overkill in some cases. If you keep your classes small (which you should do) then the shared variable would generally make sense. However, there can be other issues such as thread safety that might affect your choice.
Not passing the object's own member attributes as parameters to its methods is the normal practice: effectively when you call myobject.someMethod() you are implicitly passing the whole object (with all its attributes) as a parameter to the method code.
I generally agree with both of Mehrdad and Mufasa's comments. There's no hard and fast rule for what is best. You should use the approach that suits the specific scenarios you work on bearing in mind:
readability of code
cleanliness of code (can get messy if you pass a million and one parameters into a method - especially if they are class level variables. Alternative is to encapsulate parameters into groups, and create e.g. a struct to whole multiple values, in one object)
testability of code. This is important in my opinion. I have occassionally refactored code to add parameters to a method purely for the purpose of improving testability as it can allow for better unit testing
This is something you need to measure on a case by case basis.
For example ask yourself if you were to use parameter in a private method is it ever going to be reasonable to pass a value that is anything other than that of a specific property in the object? If not then you may as well access the property/field directly in the method.
OTH you may ask yourself does this method mutate the state of the object? If not then perhaps it may be better as a Static and have all its required values passed as parameters.
There are all sorts of considerations, the upper most has to be "What is most understandable to other developers".
In an object-oriented language it is common to pass in dependencies (classes that this class will communicate with) and configuration values in the constructor and only the values to actually be operated on in the function call.
This can actually be more readable. Consider code where you have a service that generates and publishes an invoice. There can be a variety of ways to do the publication - via a web-service that sends it to some sort of centralized server, or via an email sent to someone in the warehouse, or maybe just by sending it to the default printer. However, it is usually simpler for the method calling Publish() to not know the specifics of how the publication is happening - it just needs to know that the publication went off without a hitch. This allows you to think of less things at a time and concentrate on the problem better. Then you are simply making use of an interface to a service (in C#):
// Notice the consuming class needs only know what it does, not how it does it
public interface IInvoicePublisher {
pubic void Publish(Invoice anInvoice);
}
This could be implemented in a variety of ways, for example:
public class DefaultPrinterInvoicePublisher
DefaultPrinterInvoicePublisher _printer;
public DefaultPrinterInvoicePublisher(DefaultPrinterFacade printer) {
_printer = printer
}
public void Publish(Invoice anInvoice) {
printableObject = //Generate crystal report, or something else that can be printed
_printer.Print(printableObject);
}
The code that uses it would then take an IInvoicePublisher as a constructor parameter too so that functionality is available to be used throughout.
Generally, it's better to use parameters. Greatly increases the ability to use patterns like dependency injection and test-driven design.
If it is an internal only method though, that's not as important.
I don't pass the object's state to the private methods because the method can access the state just like that.
I pass parameters to a private method when the private method is invoked from a public method and the public method gets a parameter which it then sends to the private method.
Public DoTask( string jobid, object T)
{
DoTask1(jobid, t);
DoTask2(jobid, t);
}
private DoTask1( string jobid, object T)
{
}
private DoTask2( string jobid, object T)
{
}

When to check function/method parameters?

when writing small functions I often have the case that some parameters are given to a function which itself only passes them to different small functions to serve its purpose.
For example (C#-ish Syntax):
public void FunctionA(object param)
{
DoA(param);
DoB(param);
DoC(param);
// etc.
}
private void DoA(object param)
{
DoD(param);
}
private void DoD(object param)
{
// Error if param == null
param.DoX();
}
So the parameters are not used inside the called function but "somewhere" in the depths of the small functions that do the job.
So when is it best to check if my param-Object is null?
When checking in FunctionA:
Pro:
-There is no overhead through the use of further methods which at last will do nothing because object is null.
Con:
-My syntactically wonderful FunctionA is dirtied by ugly validation code.
When checking only when param-object is used:
Pro:
-My syntactically wonderful FunctionA keeps a joy to read :)
Cons:
-There will be overhead through calling methods which will do nothing because the param-object is null.
-Further cons I don't think about at the moment.
Always put it as far down the call stack as possible, so that if you later refactor the code and something else calls DoD other than DoA you have the check in place and don't have to rework your parameter checks. The overhead of a small null check and possibly a few extra method calls is going to be trivial in most cases and doing the check an extra few times is not something you should be worrying about.
Unless you think the value is likely to be null the vast majority of the time, I'd put the validation in DoD(). If you put it in FunctionA() you'll have to repeat the validation code later when you decide FunctionB() also needs to use DoD(). To me, the extra overhead is worth not having to repeat myself.
As a guideline, I make a habit of it to check every parameter that is used by the method, even including my own private variables. I would therefore only check for nil in your DoD method.
You might want to check out Bertrand Meyers Design By Contract mantra.
Fail early. Unless a partial result is preferable to no result at all, execution should stop as soon as the code can detect that there is a problem. Why should the code run through several methods when the result downstream is going to be an invalid or missing argument?
If it is possible that the downstream methods could be called separately then validation could be handled by a call to ca common validation methed as has already been suggested.
Always check everything :) Coming from the deep bowels of coding libraries for embedded systems, this is the method I'd use:
public void FunctionA(object param)
{
assert(param != null && param.canDoX());
DoA(param);
DoB(param);
DoC(param);
// etc.
}
private void DoA(object param)
{
assert(param != null && param.canDoX());
DoD(param);
}
private void DoD(object param)
{
assert(param != null && param.canDoX());
if ( param != null )
param.DoX();
else
// Signal error, for instance by throwing a runtime exception
// This error-handling is then verified by a unit test that
// uses a release build of the code.
}
To de-clutter this, the obvious solution is to break out the validation to a separate validator function. Using a C-style preprocessor, or just sticking to asserts, it should be trivial to have this "paranoid" validation excluded from release builds.
It's the caller's responsibility to pass a valid parameter. In this case:
if(param != null)
{
FunctionA(param);
}

How do you implement Pipes and Filters pattern with LinqToSQL/Entity Framework/NHibernate?

While building by DAL Repository, I stumbled upon a concept called Pipes and Filters. I read about it here, here and saw a screencast from here. I am still not sure how to go about implementing this pattern. Theoretically all sounds good , but how do we really implement this in an enterprise scenario?
I will appreciate, if you have any resources,tips or examples ro explanation for this pattern in context to the data mappers/ORM mentioned in the question.
Thanks in advance!!
Ultimately, LINQ on IEnumerable<T> is a pipes and filters implementation. IEnumerable<T> is a streaming API - meaning that data is lazily returns as you ask for it (via iterator blocks), rather than loading everything at once, and returning a big buffer of records.
This means that your query:
var qry = from row in source // IEnumerable<T>
where row.Foo == "abc"
select new {row.ID, row.Name};
is:
var qry = source.Where(row => row.Foo == "abc")
.Select(row = > new {row.ID, row.Name});
as you enumerate over this, it will consume the data lazily. You can see this graphically with Jon Skeet's Visual LINQ. The only things that break the pipe are things that force buffering; OrderBy, GroupBy, etc. For high volume work, Jon and myself worked on Push LINQ for doing aggregates without buffering in such scenarios.
IQueryable<T> (exposed by most ORM tools - LINQ-to-SQL, Entity Framework, LINQ-to-NHibernate) is a slightly different beast; because the database engine is going to do most of the heavy lifting, the chances are that most of the steps are already done - all that is left is to consume an IDataReader and project this to objects/values - but that is still typically a pipe (IQueryable<T> implements IEnumerable<T>) unless you call .ToArray(), .ToList() etc.
With regard to use in enterprise... my view is that it is fine to use IQueryable<T> to write composable queries inside the repository, but they shouldn't leave the repository - as that would make the internal operation of the repository subject to the caller, so you would be unable to properly unit test / profile / optimize / etc. I've taken to doing clever things in the repository, but return lists/arrays. This also means my repository stays unaware of the implementation.
This is a shame - as the temptation to "return" IQueryable<T> from a repository method is quite large; for example, this would allow the caller to add paging/filters/etc - but remember that they haven't actually consumed the data yet. This makes resource management a pain. Also, in MVC etc you'd need to ensure that the controller calls .ToList() or similar, so that it isn't the view that is controlling data access (otherwise, again, you can't unit test the controller properly).
A safe (IMO) use of filters in the DAL would be things like:
public Customer[] List(string name, string countryCode) {
using(var ctx = new CustomerDataContext()) {
IQueryable<Customer> qry = ctx.Customers.Where(x=>x.IsOpen);
if(!string.IsNullOrEmpty(name)) {
qry = qry.Where(cust => cust.Name.Contains(name));
}
if(!string.IsNullOrEmpty(countryCode)) {
qry = qry.Where(cust => cust.CountryCode == countryCode);
}
return qry.ToArray();
}
}
Here we've added filters on-the-fly, but nothing happens until we call ToArray. At this point, the data is obtained and returned (disposing the data-context in the process). This can be fully unit tested. If we did something similar but just returned IQueryable<T>, the caller might do something like:
var custs = customerRepository.GetCustomers()
.Where(x=>SomeUnmappedFunction(x));
And all of a sudden our DAL starts failing (cannot translate SomeUnmappedFunction to TSQL, etc). You can still do a lot of interesting things in the repository, though.
The only pain point here is that it might push you to have a few overloads to support different calling patterns (with/without paging, etc). Until optional/named parameters arrives, I find the best answer here is to use extension methods on the interface; that way, I only need one concrete repository implementation:
class CustomerRepository {
public Customer[] List(
string name, string countryCode,
int? pageSize, int? pageNumber) {...}
}
interface ICustomerRepository {
Customer[] List(
string name, string countryCode,
int? pageSize, int? pageNumber);
}
static class CustomerRepositoryExtensions {
public static Customer[] List(
this ICustomerRepository repo,
string name, string countryCode) {
return repo.List(name, countryCode, null, null);
}
}
Now we have virtual overloads (as extension methods) on ICustomerRepository - so our caller can use repo.List("abc","def") without having to specify the paging.
Finally - without LINQ, using pipes and filters becomes a lot more painful. You'll be writing some kind of text based query (TSQL, ESQL, HQL). You can obviously append strings, but it isn't very "pipe/filter"-ish. The "Criteria API" is a bit better - but not as elegant as LINQ.

Which design is better for a class that simply runs a self-contained computation?

I'm currently working on a class that calculates the difference between two objects. I'm trying to decide what the best design for this class would be. I see two options:
1) Single-use class instance. Takes the objects to diff in the constructor and calculates the diff for that.
public class MyObjDiffer {
public MyObjDiffer(MyObj o1, MyObj o2) {
// Calculate diff here and store results in member variables
}
public boolean areObjectsDifferent() {
// ...
}
public Vector getOnlyInObj1() {
// ...
}
public Vector getOnlyInObj2() {
// ...
}
// ...
}
2) Re-usable class instance. Constructor takes no arguments. Has a "calculateDiff()" method that takes the objects to diff, and returns the results.
public class MyObjDiffer {
public MyObjDiffer() { }
public DiffResults getResults(MyObj o1, MyObj o2) {
// calculate and return the results. Nothing is stored in this class's members.
}
}
public class DiffResults {
public boolean areObjectsDifferent() {
// ...
}
public Vector getOnlyInObj1() {
// ...
}
public Vector getOnlyInObj2() {
// ...
}
}
The diffing will be fairly complex (details don't matter for the question), so there will need to be a number of helper functions. If I take solution 1 then I can store the data in member variables and don't have to pass everything around. It's slightly more compact, as everything is handled within a single class.
However, conceptually, it seems weird that a "Differ" would be specific to a certain set of results. Option 2 splits the results from the logic that actually calculates them.
EDIT: Option 2 also provides the ability to make the "MyObjDiffer" class static. Thanks kitsune, I forgot to mention that.
I'm having trouble seeing any significant pro or con to either option. I figure this kind of thing (a class that just handles some one-shot calculation) has to come up fairly often, and maybe I'm missing something. So, I figured I'd pose the question to the cloud. Are there significant pros or cons to one or the other option here? Is one inherently better? Does it matter?
I am doing this in Java, so there might be some restrictions on the possibilities, but the overall question of design is probably language-agnostic.
Use Object-Oriented Programming
Use option 2, but do not make it static.
The Strategy Pattern
This way, an instance MyObjDiffer can be passed to anyone that needs a Strategy for computing the difference between objects.
If, down the road, you find that different rules are used for computation in different contexts, you can create a new strategy to suit. With your code as it stands, you'd extend MyObjDiffer and override its methods, which is certainly workable. A better approach would be to define an interface, and have MyObjDiffer as one implementation.
Any decent refactoring tool will be able to "extract an interface" from MyObjDiffer and replace references to that type with the interface type at some later time if you want to delay the decision. Using "Option 2" with instance methods, rather than class procedures, gives you that flexibility.
Configure an Instance
Even if you never need to write a new comparison method, you might find that specifying options to tailor the behavior of your basic method is useful. If you think about using the "diff" command to compare text files, you'll remember how many different options there are: whitespace- and case-sensitivity, output options, etc. The best analog to this in OO programming is to consider each diff process as an object, with options set as properties on that object.
You want solution #2 for a number of reasons. And you don't want it to be static.
While static seems like fun, it's a maintenance nightmare when you come up with either (a) a new algorithm with the same requirements, or (b) new requirements.
A first-class object (without much internal state) allows you to evolve into a class hierarchy of different differs -- some slower, some faster, some with more memory, some with less memory, some for old requirements, some for new requirements.
Some of your differs may wind up with complicated internal state or memory, or incremental diffing or hash-code-based diffing. All kinds of possibilities might exist.
A reusable object allows you to pick your differ at application start-up time using a properties file.
In the long run, you want to minimize the number of new operations that are scattered throughout your application. You'd like to have your new operations focused in places where you can find and control them. To change from old differ algorithm to new differ algorithm, you'd like to do the following.
Write the new subclass.
Update a properties file to start using the new subclass.
And be completely confident that there wasn't some hidden d= new MyObjDiffer( x, y ) tucked away that you didn't know about.
You want to use d= theDiffer.getResults( x, y ) everywhere.
What the Java libraries do is they have a DifferFactory that's static. The factor checks the properties and emits the correct Differ.
DifferFactory df= new DifferFactory();
MyObjDiffer mod= df.getDiffer();
mod.getResults( x, y );
The Factory typically caches the single copy -- it doesn't have to physically read the properties every time getDiffer is called.
This design gives you ultimate flexibility in the future. At it looks like other parts of the Java libraries.
I can't really say I have firm reasons why it's the 'best' approach, but I usually write classes for objects that you can have a 'conversation' with. So it would be like your 'single use' option 1, except that by calling a setter, you would 'reset' it for another use.
Rather than supplying the implementation (which is pretty obvious), here's a sample invocation:
MyComparer cmp = new MyComparer(obj1, obj2);
boolean match = cmp.isMatch();
cmp.setSubjects(obj3,obj4);
List uniques1 = cmp.getOnlyIn(MyComparer.FIRST);
cmd.setSubject(MyComparer.SECOND,obj5);
List uniques = cmp.getOnlyIn(MyComparer.SECOND);
... and so on.
This way, the caller gets to decide whether they want to instantiate lots of objects, or keep reusing the one.
It's particularly useful if the object needs a lot of setup. Lets say your comparer takes options. There could be many. Set it up once, then use it many times.
// set up cmp with options and the master object
MyComparer cmp = new MyComparer();
cmp.setIgnoreCase(true);
cmp.setIgnoreTrailingWhitespace(false);
cmp.setSubject(MyComparer.FIRST,canonicalSubject);
// find items that are in the testSubjects objects,
// but not in the master.
List extraItems = new ArrayList();
for (Iterator it=testSubjects.iterator(); it.hasNext(); ) {
cmp.setSubject(MyComparer.SECOND,it.next());
extraItems.append(cmp.getOnlyIn(MyComparer.SECOND);
}
Edit: BTW I called it MyComparer rather than MyDiffer because it seemed more natural to have an isMatch() method than an isDifferent() method.
I'd take numero 2 and reflect on whether I should make this static.
Why are you writing a class whose only purpose is to calculate the difference between two objects? That sounds like a task either for a static function or a member function of the class.
I would go for a static constructor method, something like.
Diffs diffs = Diffs.calculateDifferences(foo, bar);
In this way, it's clear when you're calculating the differences, and there is no way to misuse the object's interface.
I like the idea of explicitly starting the work rather than having it occur on instantiation. Also, I think the results are substantial enough to warrant their own class. Your first design isn't as clean to me. Someone using this class would have to understand that after performing the calculation some other class members are now holding the results. Option 2 is more clear about what is happening.
It depends on how you're going to use diffs. In my mind, it makes sense to treat diffs as a logical entity because it needs to support some operations like 'getDiffString()', or 'numHunks()', or 'apply()'. I might take your first one and do it more like this:
public class Diff
{
public Diff(String path1, String path2)
{
// get diff
if (same)
throw new EmptyDiffException();
}
public String getDiffString()
{
}
public int numHunks()
{
}
public bool apply(String path1)
{
// try to apply diff as patch to file at path1. Return
// whether the patch applied successfully or not.
}
public bool merge(Diff diff)
{
// similar to apply(), but do merge yourself with another diff
}
}
Using a diff object like this also might lend itself to things like keeping a stack of patches, or serializing to a compressed archive, maybe an "undo" queue, and so on.