I have been reading Clean Code by Robert C. Martin and have a basic (but fundamental) question about functions and program structure.
The book emphasizes that functions should:
be brief (like 10 lines, or less)
do one, and only one, thing
I am a little unclear on how to apply this in practice. For example, I am developing a program to:
load a baseline text file
parse baseline text file
load a test text file
parse test text file
compare parsed test with parsed baseline
aggregate results
I have tried two approaches, but neither seem to meet Martin's criteria:
APPROACH 1
setup a Main function that centrally commands other functions in the workflow. But then main() can end up being very long (violates #1), and is obviously doing many things (violates #2). Something like this:
main()
{
// manage steps, one at a time, from start to finish
baseFile = loadFile("baseline.txt");
parsedBaseline = parseFile(baseFile);
testFile = loadFile("test.txt");
parsedTest = parseFile(testFile);
comparisonResults = compareFiles(parsedBaseline, parsedTest);
aggregateResults(comparisonResults);
}
APPROACH 2
use Main to trigger a function "cascade". But each function is calling a dependency, so it still seems like they are doing more than one thing (violates #2?). For example, calling the aggregation function internally calls for results comparison. The flow also seems backwards, as it starts with the end goal and calls dependencies as it goes. Something like this:
main()
{
// trigger end result, and let functions internally manage
aggregateResults("baseline.txt", "comparison.txt");
}
aggregateResults(baseFile, testFile)
{
comparisonResults = compareFiles(baseFile, testFile);
// aggregate results here
return theAggregatedResult;
}
compareFiles(baseFile, testFile)
{
parsedBase = parseFile(baseFile);
parsedTest = parseFile(testFile);
// compare parsed files here
return theFileComparison;
}
parseFile(filename)
{
loadFile(filename);
// parse the file here
return theParsedFile;
}
loadFile(filename)
{
//load the file here
return theLoadedFile;
}
Obviously functions need to call one another. So what is the right way to structure a program to meet Martin's criteria, please?
I think you are interpreting rule 2 wrong by not taking context into account. The main() function only does one thing and that is everything, i.e. running the whole program. Let's say you have a convert_abc_file_xyz_file(source_filename, target_filename) then this function should only do the one thing its name (and arguments) implies: converting a file of format abc into one of format xyz. Of course on a lower level there are many things to be done to achieve this. For instancereading the source file (read_abc_file(…)), converting the data from format abc into format xyz (convert_abc_to_xyz(…)), and then writing the converted data into a new file (write_xyz_file(…)).
The second approach is wrong as it becomes impossible to write functions that only do one thing because every functions does all the other things in the ”cascaded” calls. In the first approach it is possible to test or reuse single functions, i.e. just call read_abc_file() to read a file. If that function calls convert_abc_to_xyz() which in turn calls write_xyz_file() that is not possible anymore.
Related
I tried to use remotecall() in julia to distribute work to specific processor. The function I like to run does not have any return but it will output something. I can not make it work as there is no output file after running the code.
This is the test code I am creating:
using DelimitedFiles
addprocs(4) # add 4 processors
#everywhere function test(x) # Define the function
print("hi")
writedlm(string("test",string(x),".csv"), [x], ',')
end
remotecall(test, 2, 2) # To run the function on process 2
remotecall(test, 3, 3) # To run the function on process 3
This is the output I am getting:
Future(3, 1, 67, nothing)
And there is no output file (csv), or "hi" shown
I wonder if anyone can help me with this or I did anything wrong. I am fairly new to julia and have never used parallel processing.
The background is I need to run a big simulation (A big function with bunch of includes, but no direct return outputs) lots of times, and I like to split the work to different processors.
Thanks a lot
If you want to use a module function in a worker, you need to import that module locally in that worker first, just like you have to do it in your 'root' process. Therefore your using DelimitedFiles directive needs to occur "#everywhere" first, rather than just on the 'root' process. In other words:
#everywhere using DelimitedFiles
Btw, I am assuming you're using a relatively recent version of Julia and simply forgot to add the using Distributed directive in your example.
Furthermore, when you perform a remote call, what you get back is a "Future" object, which is a way of allowing you to obtain the 'future results of that computation' from that worker, once they're finished. To get the results of that 'future computation', use fetch.
This is all very simplistic and general information, since you haven't provided a specific example that can be copy / pasted and answered specifically. Have a look at the relevant section in the manual, it's fairly clearly written: https://docs.julialang.org/en/v1/manual/parallel-computing/#Multi-Core-or-Distributed-Processing-1
I have files in one folder with following naming convention
ClientID_ClientName_Date_Fileextension
12345_Dell_20110103.CSV
I want to extract ClientID from the filename and store that in a variable. I am unsure how I would do this. It seems that a Script Task would suffice but I am do not know how to proceed.
Your options are using Expressions on SSIS Variables or using a Script Task. As a general rule, I prefer Expressions but mentally, I can tell that's a lot of code, or a lot of intertwined variables.
Instead, I'd use the String.Split method in .NET. If you called the Split method for your sample data and provided a delimiter of the underscore _ then you'd receive a 3 element array
12345
Dell
20110103.CSV
Wrap that in a Try Catch block and always grab the second element. Quick and dirty but of course won't address things like 12345_Dell_Quest_20110103.CSV but you didn't ask that question.
Code approximate
string phrase = Dts.Variables["User::CurrentFile"].Value.ToString()
string[] stringSeparators = new string[] {"-"};
string[] words;
try
{
words = phrase.Split(stringSeparators, StringSplitOptions.None);
Dts.Variables["User::ClientName"].Value = words[1];
}
catch
{
; // Do something with this error
}
First, here the way i'm calling the function :
eval([functionName '(''stringArg'')']); % functionName = 'someStringForTheFunctionName'
Now, I have two functionName functions in my path, one that take the stringArg and another one that takes something else. I'm getting some errors because right now the first one it finds is the function that doesn't take the stringArg. Considering the way i'm calling the functionName function, how is it possible to call the correct function?
Edit:
I tried the function which :
which -all someStringForTheFunctionName
The result :
C:\........\x\someStringForTheFunctionName
C:\........\y\someStringForTheFunctionName % Shadowed
The shadowed function is the one i want to call.
Function names must be unique in MATLAB. If they are not, so there are duplicate names, then MATLAB uses the first one it finds on your search path.
Having said that, there are a few options open to you.
Option 1. Use # directories, putting each version in a separate directory. Essentially you are using the ability of MATLAB to apply a function to specific classes. So, you might set up a pair of directories:
#char
#double
Put your copies of myfun.m in the respective directories. Now when MATLAB sees a double input to myfun, it will direct the call to the double version. When MATLAB gets char input, it goes to the char version.
BE CAREFUL. Do not put these # directories explicitly on your search path. DO put them INSIDE a directory that is on your search path.
A problem with this scheme is if you call the function with a SINGLE precision input, MATLAB will probably have a fit, so you would need separate versions for single, uint8, int8, int32, etc. You cannot just have one version for all numeric types.
Option 2. Have only one version of the function, that tests the first argument to see if it is numeric or char, then branches to perform either task as appropriate. Both pieces of code will most simply be in one file then. The simple scheme will have subfunctions or nested functions to do the work.
Option 3. Name the functions differently. Hey, its not the end of the world.
Option 4: As Shaun points out, one can simply change the current directory. MATLAB always looks first in your current directory, so it will find the function in that directory as needed. One problem is this is time consuming. Any time you touch a directory, things slow down, because there is now disk input needed.
The worst part of changing directories is in how you use MATLAB. It is (IMHO) a poor programming style to force the user to always be in a specific directory based on what code inputs they wish to run. Better is a data driven scheme. If you will be reading in or writing out data, then be in THAT directory. Use the MATLAB search path to categorize all of your functions, as functions tend not to change much. This is a far cleaner way to work than requiring the user to migrate to specific directories based on how they will be calling a given function.
Personally, I'd tend to suggest option 2 as the best. It is clean. It has only ONE main function that you need to work with. If you want to keep the functions district, put them as separate nested or sub functions inside the main function body. Inside of course, they will have distinct names, based on how they are driven.
OK, so a messy answer, but it should do it. My test function was 'echo'
funcstr='echo'; % string representation of function
Fs=which('-all',funcstr);
for v=1:length(Fs)
if (strcmp(Fs{v}(end-1:end),'.m')) % Don''t move built-ins, they will be shadowed anyway
movefile(Fs{v},[Fs{v} '_BK']);
end
end
for v=1:length(Fs)
if (strcmp(Fs{v}(end-1:end),'.m'))
movefile([Fs{v} '_BK'],Fs{v});
end
try
eval([funcstr '(''stringArg'')']);
break;
catch
if (strcmp(Fs{v}(end-1:end),'.m'))
movefile(Fs{v},[Fs{v} '_BK']);
end
end
end
for w=1:v
if (strcmp(Fs{v}(end-1:end),'.m'))
movefile([Fs{v} '_BK'],Fs{v});
end
end
You can also create a function handle for the shadowed function. The problem is that the first function is higher on the matlab path, but you can circumvent that by (temporarily) changing the current directory.
Although it is not nice imo to change that current directory (actually I'd rather never change it while executing code), it will solve the problem quite easily; especially if you use it in the configuration part of your function with a persistent function handle:
function outputpars = myMainExecFunction(inputpars)
% configuration
persistent shadowfun;
if isempty(shadowfun)
funpath1 = 'C:\........\x\fun';
funpath2 = 'C:\........\y\fun'; % Shadowed
curcd = cd;
cd(funpath2);
shadowfun = #fun;
cd(curcd); % and go back to the original cd
end
outputpars{1} = shadowfun(inputpars); % will use the shadowed function
oupputpars{2} = fun(inputparts); % will use the function highest on the matlab path
end
This problem was also discussed here as a possible solution to this problem.
I believe it actually is the only way to overload a builtin function outside the source directory of the overloading function (eg. you want to run your own sum.m in a directory other than where your sum.m is located.)
EDIT: Old answer no longer good
The run command won't work because its a function, not a script.
Instead, your best approach would be honestly just figure out which of the functions need to be run, get the current dir, change it to the one your function is in, run it, and then change back to your start dir.
This approach, while not perfect, seems MUCH easier to code, to read, and less prone to breaking. And it requires no changing of names or creating extra files or function handles.
I have a certain function that uses the same (few, 2-5 depending on how I may change it to accommodate possible future uses) lines of code 4 times.
I looked at this question, but it's not specific enough for me, and doesn't match the direction I'm going for.
Here's some pseudo:
function myFunction() {
if (something) {
// Code line 1
// Code line 2
// Code line 3
}
else if (somethingElse) {
// Code line 1
// Code line 2
// Code line 3
}
else if (anotherThing) {
// Code line 1
// Code line 2
// Code line 3
}
else if (theLastThing) {
// Code line 1
// Code line 2
// Code line 3
}
else {
// Not previously used code
}
}
Those same 3 lines of code are copy/pasted (constructing the same object if any of these conditions are met). Is it a good practice to create a function that I can pass all this information to and return the necessary information when it's finished? All of these conditional statements are inside a loop that could run up to 1000 or so times.
I'm not sure if the cost of preparing the stack frame(?) by jumping into another function is more costly over 1000 iterations to be worth having ~15 lines of duplicated code. Obviously function-alizing it would make it more readable, however this is very specific functionality that is not used anywhere else. The function I could write to eliminate the copy/paste mentality would be something like:
function myHelperFunction(someParameter, someOtherParameter) {
// Code line 1
// Code line 2
// Code line 3
return usefulInformation;
}
And then call the function in all those conditional statements as 1 line per conditional statement:
myHelperFunction(myPassedParameter, myOtherPassedParameter);
Essentially turning those 12 lines into 4.
So the question - is this a good practice in general, to create a new function for a very small amount of code to save some space and readability? Or is the cost for jumping functions too impacting to be worth it? Should one always create a new function for any code that they might copy/paste in the future?
PS - I understand that if this bit of code were to be used in different (Classes) or source files that it would be logical to turn it into a function to avoid needing to find all the locations where it was copy/pasted in order to make changes. But I'm talking more or less single-file/single-Class or in-function kind of a dilemma.
Also, feel free to fix my tags/title if I didn't do it correctly. I'm not really sure how to title/tag this post correctly.
The answer to any optimization question that isn't also an algorithms/data structures question is: Profile your code! Only optimize things that show up as problem areas.
Which means you should find out if function call overhead is actually a performance problem in the specific program you're writing. If it is, inline the code. If it isn't, don't. Simple as that.
You're approaching this the wrong way, in my opinion. In the first place, you shouldn't be using multiple (else)ifs that all execute the same code; use one with a compound or precomputed (in this case I recommend precomputed due to all the possible subconditions) condition. Something like this will probably make maintaining the code a lot easier.
function myFunction() {
bool condition = something ||
somethingElse ||
anotherThing ||
theLastThing;
if (condition) {
// Code line 1
// Code line 2
// Code line 3
}
else {
// Not previously used code
}
}
Yes create a function, in general you should follow the DRY principal. Don't Repeat Yourself.
http://en.wikipedia.org/wiki/Don%27t_repeat_yourself
Your stack operations are going to be minimal for something like this. See Imre Kerr's comment on your question.
It's not just for readability. So many reasons. Maintainability is huge. If this code has to change, it will be a pain for someone else to come along and try to figure out every place to change it. It's a lot better to only have to change code in one place.
I don't know if this apply to the example that you provided, but factoring code is not the only reason to write a function, you can also think in term of tests
A function provides a programming unit that can be tested separately.
So it may happen that you decompose a complex operation into several simpler/more elementary units, even if those functions are only called once.
Since you asked the question for a few lines of code, you could ask yourself:
can I reasonnably name this function?( justDoThis should be OK, doThisAndThatAndThenAnotherThing less so)
does it have a reasonnable number of parameters?(I would say two or three)
is it worth testing it as a separate unit?(does it simplify overall testing)
is the code more readable/understandable with such function call or not?(if answer to first two questions is no, it's not necessarily obvious)
This is a wonderful question, and the answer is: It depends.
Personally I would create a function for increased code readability, but If you are looking for efficiency maybe you would want to leave the code copied and pasted.
I've got a simple function like this:
function CreateFileLink()
{
global $Username;
return date("d-m-y-g-i");
}
How do you write code to test a function like that where the return data is variable?
You could test it if you could somehow get control over that date() function. For example, in this case, you only care that the date function is being called with "d-m-y-g-i"; you don't really care about the output of it. Maybe something like:
function CreateFileLink(DateProvider dateProvider)
{
global $Username;
return dateProvider.date("d-m-y-g-i");
}
Sorry, I don't even know what language this is but hopefully you can see my point. In production code, you'd call this by doing:
DateProvider standardDateProvider = new DateProvider() { Date date(String str) { return date(str); } };
CreateFileLink(standardDateProvider);
But in your test, you can provide an alternate implementation that will throw an error if the input isn't what you expect:
DateProvider mockProvider = new DateProvider()
{
Date date(String str)
{
if(str != "d-m-y-g-i") throw Exception();
return "success";
}
}
What your unit tests need to do is setup the environment for this function to work properly, in other words, you simulate as if the system was running by setting up other variables, then you SHOULD be able to know what it will return based on how your unit test set up these variables (unless of course the return value is a random number, in which case all you could do, as Randolpho suggested, is make sure it doesn't throw).
If your unit tests is found in this situation of setting up and calling a whole bunch of other methods just to test this function, it's probably a good indication that your function is tightly coupled and you could probably break it down into smaller pieces.
There's a Google Tech Talk good about dependency injection.
Basically, the ideal way to create testable code is to make all of your dependencies explicit so that you do not wind-up with situations where calling the same function twice can result in two different answers.
In the case of dates and such, if you create an "system object" which you pass explicitly into your function, you could then create a "test system object" which could used for testing and which would return fixed values rather than returning the current date.
Well, if you use a unit test, you know which day, month and year you are testing the function.
So, call the function. You have the current time (when the test is run). Subtract the current time from the function time. The difference should be very small (like 2 seconds).
If the difference is greater than that, I would fail the test. (depending on the speed of your system of course).
It looks like the only test you can run on this function is that it does not throw an exception. There is no need to test the return data, as it's generated by an external entity (the date() function).