When is it okay to check if a file exists? - language-agnostic

File systems are volatile. This means that you can't trust the result of one operation to still be valid for the next one, even if it's the next line of code. You can't just say if (some file exists and I have permissions for it) open the file, and you can't say if (some file does not exist) create the file. There is always the possibility that the result of your if condition will change in between the two parts of your code. The operations are distinct: not atomic.
To make matters worse, the nature of the problem means that if you're tempted to make this check, odds are you're already worried or aware that something you don't control is likely to happen to the file. The nature of development environments make this event less likely to happen during your testing and very difficult to reproduce. So not only do you have a bug, but the bug won't show up while testing.
Therefore under normal circumstances the best course of action is to not even try to check if a file or directory exists. Instead, put your development time into handling exceptions from the file system. You have to handle these exceptions anyway, so this is a much better use of your resources. Even though exceptions are slow, checking the existence of a file requires an extra trip to disk, and disk access is much slower. I even have a well-voted answer to this effect in another question.
But I'm having some doubts. In .Net, for example, if that's really always true, the .Exists() methods wouldn't be in the API in the first place. Also consider scenarios where you expect your program to need to the create file. The first example that comes to mind is for a desktop application. This application installs a default user-config file to it's home directory, and the first time each user starts the application it copies this file to that user's application data folder. It expects the file not to exist on that first startup.
So when is it acceptable to check in advance for the existence (or other attributes, like size and permissions) of a file? Is expecting failure rather than success on the first attempt a good enough rule of thumb?

The File.Exists method exists primarily for testing for the existence of a file when you do not intend to open the file. For example testing for the existence of a locking file whose very existence tells you something but whose contents are immaterial.
If you are going to open the file then you will need to handle any exception regardless of the results of any prior calls to File.Exists. So, in general, there is no real value in calling it in these circumstances. Just use the appropriate FileMode enumeration value in your open method and handle any exceptions, as simple as that.
EDIT: Even though this is couched in terms of the .Net API, it is based on the underlying system API. Both Windows and Unix have system calls (i.e. CreateFile) that use the equivalent of the FileMode enumeration. In fact in .Net (or Mono) the FileMode value is just passed through to the underlying system call.

As a general policy, methods like File.Exists, or properties like WeakReference.Alive or SomeConcurrentQueue.Count are not useful as a means of ensuring that a "good" state exists, but can be useful as a means of determining that a "bad" state exists without doing any unnecessary (and possibly counterproductive) work. Such situations may arise in many scenarios involving locks (and files, since they often include locks). Because all routines that need to lock on a set of resources should, whenever practical, always acquire locks on those resources in a consistent order, it may be necessary to acquire a lock on one resource which is expected to exist before acquiring a resource which may or may not exist. In such a scenario, while it's impossible to avoid the possibility that one might lock the first resource, fail to acquire the second, and then release the first lock without having done any useful work with it, checking for the existence of the second resource before acquiring the lock on the first would minimize unnecessary and useless effort.

It depends on your requirements, but one way is to try to obtain an exclusive open file handle, with some sort of retry mechanism. Once you have that handle, it's going to be hard (or impossible) for another process to delete (or move) that file.
I've used code in .NET similiar to the following to obtain an exclusive file handle, where I expect some other process to be possibly writing the file:
FileInfo fi = new FileInfo(fullFilePath);
int attempts = maxAttempts;
do
{
try
{
// Asking to open for reading with exclusive access...
fs = fi.Open(FileMode.Open, FileAccess.Read, FileShare.None);
}
// Ignore any errors...
catch {}
if (fs != null)
{
break;
}
else
{
Thread.Sleep(100);
}
}
while (--attempts > 0);

One example: You may be able to check for existence of files which you are unable to open (due to, for example, permissions).
Another, possibly better example: You want to check for the existence of a Unix device file. But definitely do not open it; opening it has side effects (e.g., open/close /dev/st0 will rewind the tape)

In *nix environment a well established method for checking if another copy of the program is already running is to create a lock file. So the check for file existence is used to verify this.

I'd only check it if I expect it to be missing (e.g. the application settings) and only if I have to read the file.
If I have to write to the file, it's either a logfile (so I can just append to it or create a new one) or I replace the contents of it, so I might as well recreate it anyway.
If I expect that the file exists, it would be right that an Exception is thrown. Exception handling should then inform the user or perform recovery. My opinion is that this results in cleaner code.
File protection (i.e. not overwriting (possibly important) files) is different, in that case I'd always check whether a file exists, if the framework doesn't do that for me (think SaveFileDialog)

I think the check makes sense when you want to be sure the file was there in the first place. As you said settings files...if there is a file I will try and merge the existing settings instead of blowing them away.
Other cases would be when a user tells me to do something with a file. Yes I know the openFileDialog will check if a file exists (But this is optional). I vaguely remeber back in VB6 this was not the case, so verifying the file existed that they just told me to use was common.
I'd rather not program by exception.
Edit
I didn't miss the point. You might try and access the file, an exception is thrown and then when you go to create the file, the file was already placed there. Which now causes your exception handling code to go on the fritz. So I guess we could then have an exception handler in our exception handler to catch that the file changed yet again...
I'd rather try and prevent exceptions, not use them to control logic.
Edit
Additionally another time to check for attributes such as size is when your waiting for a file operation to finish, yes you never know for sure but with a good algorithim and depending on the system writting the file you might be able to handle a good deal of cases (Had a system running for five years which watched for small files coming over ftp, and it uses a the same api as the file system watcher, and then starts polling waiting for the file to stop changing, before raising an event that the file is ready to be consumed).

This may be too simplistic, but I would think the primary reason for checking for the existence of a file (hence the existence of .Exists()) would be to prevent unintended overwrites of existing files, not to avoid exceptions caused by attempting to access non-existent nor non-accessible files.
EDIT 2
This was, in fact, too simplistic and I recommend you see Stephen Martin's response.

If you're that concerned about somebody else removing the file, perhaps you should implement some sort of locking system. For instance, I used to work on the code for C-News, a Usenet news server. Since a lot of the things it did could happen asynchronously, it would "lock" a file or a directory by making a temp file, and then hard linking it to a file named "LOCK". If the link failed, it would mean that some other version of the program was writing to that directory, otherwise it was yours and you could do what you like.
The nifty thing about this is that most of the program was written in shell and awk, and this was a very portable locking mechanism. Also, the lock file would contain the PID of the owner, so you could look at the existing lock file to see if the owner was still running.

We have a diagnostic tool that has to gather a set of files, installer log included. Depending on different conditions the installer log can be in one of two folders. Even worse, there can be different versions of the log in both of these folders. How does the tool find the right one?
It's quite simple if you check for existence. If only one is present, grab that file. If two exist, find which has the latest modification time and grab that file. That's just normal way of doing things.

While this is a language-agnostic post, it seems you are talking about .NET. Most systems (.NET and others) have more detailed APIs in order to figure out if the file exists when opening the file.
What you should do is make a call to access the file, as it will typically indicate through some sort of error that the file doesn't exist (if it truly doesn't). In .NET, you would have to go through the P/Invoke layer and use the CreateFile API function. If that function returns an error of ERROR_FILE_NOT_FOUND, then you know that the file does not exist. If it returns successfully, then you have a handle that you can use.
The point here is that it is a somewhat atomic operation, which ultimately is what you are looking for.
Then, with the handle, you can pass it to a FileStream constructor and perform your work on the file.

There are a numbers of possible applications you may well be writing that a simple File.Exists is more than adequate for the job. If it's a config file that only your application will use then you do not need to go so overkill in your exception handling.
Whilst the "flaws" you have pointed out in using this method are all valid, it doesn't mean they are not acceptable flaws for some situations.

A variety of apps include built-in web servers. It's common for them to generate self-signed SSL certificates the first time they start up. A straightforward way to implement this would be to check whether the cert exists on startup, and create it if not.
In theory, it could exist for the check, and not exist later. In that case, we'd get an error when we try to listen, but that can be handled quite easily and is not a big deal.
It's also possible that it doesn't exist for the check, and exists later. In that case, it either gets overwritten with a new cert, or writing the new cert fails, depending on your policy. The first is a little annoying, in terms of the cert change causing some alarm, but also not really critical, especially if you do a bit of logging to indicate what is going on.
And, in practice, both cases are extraordinarily unlikely to ever come up.

Like you pointed out its always important what the program should do if the file is missing. In all my applications the user can always delete the config file and the application will create a new one with default values. No Problem. I also ship my applications without config files.
But users tend to delete files and even files they should not delete like serial keys and template files. I always check for these files because without them the application is unable to run at all. I can not create a new serial key from default.
Whats should happen when the file is missing? You can do a file find or exception handler but the real question is : What will happen when the file is missing? Or how important is the file for the application. I check all the time before I try to access any support files for the app. Additional I do error handling if the file is corrupt and can not be loaded.

I think anytime that you know that the file may or may not exist and you want to perform some alternate action based on the existence of the file, you should do the check because in this case it's not an exceptional condition for the file to not exist. This won't absolve you from having to handle exceptions -- from someone else either removing or creating the file between the check and your open -- but it makes the intent of the program clear and doesn't rely on exception handling to perform flow-control logic.
EDIT: An example might be log rotation on start up.
try
{
if (File.Exists("app.log"))
{
RotateLogs();
}
log = File.Open("app.log", FileMode.CreateNew );
}
catch (IOException)
{
...another writer, perhaps?
}
catch (UnauthorizedAccessException)
{
...maybe I should have used runas?
}

To answer my own question (in part), I want to expand on the example I used: a default config file.
Rather than check if it exists at app startup and try to copy the file if the check fails, the thing to do is always try to copy the file. You just do it in such a way that the copy will fail if the file exists rather than replace an existing file. This way all you need to do is catch and ignore any exception thrown if the copy fails because of an existing file.

Your problem could easily be solved with basic computer science... read up on Semaphores.
(I did not mean to sound like a jerk, I was just pointing you to a simple answer for a common problem).

I think the reason for "Exists" is to determine when files are missing without the need for creating all the OS housekeeping data required to access the file or having exceptions being thrown. So it's a file handling optimisation more than anything else.
For a single file, the saving the "Exists" gives is generally insignificant. If you were checking if a file exists many, many times (for example, searching for #include files) then the saving could be significant.
In .Net, the specification for File.Exists doesn't list any exceptions that the method might throw, unlike for example File.Open which lists nine exceptions, so there's certainly less checking going on in the former.
Even if "Exists" returns true, you still need to handle exceptions when opening the file, as the .Net reference suggests.

Related

Is it ok to use open inline with json.dump [duplicate]

In Python, if you either open a file without calling close(), or close the file but not using try-finally or the "with" statement, is this a problem? Or does it suffice as a coding practice to rely on the Python garbage-collection to close all files? For example, if one does this:
for line in open("filename"):
# ... do stuff ...
... is this a problem because the file can never be closed and an exception could occur that prevents it from being closed? Or will it definitely be closed at the conclusion of the for statement because the file goes out of scope?
In your example the file isn't guaranteed to be closed before the interpreter exits. In current versions of CPython the file will be closed at the end of the for loop because CPython uses reference counting as its primary garbage collection mechanism but that's an implementation detail, not a feature of the language. Other implementations of Python aren't guaranteed to work this way. For example IronPython, PyPy, and Jython don't use reference counting and therefore won't close the file at the end of the loop.
It's bad practice to rely on CPython's garbage collection implementation because it makes your code less portable. You might not have resource leaks if you use CPython, but if you ever switch to a Python implementation which doesn't use reference counting you'll need to go through all your code and make sure all your files are closed properly.
For your example use:
with open("filename") as f:
for line in f:
# ... do stuff ...
Some Pythons will close files automatically when they are no longer referenced, while others will not and it's up to the O/S to close files when the Python interpreter exits.
Even for the Pythons that will close files for you, the timing is not guaranteed: it could be immediately, or it could be seconds/minutes/hours/days later.
So, while you may not experience problems with the Python you are using, it is definitely not good practice to leave your files open. In fact, in cpython 3 you will now get warnings that the system had to close files for you if you didn't do it.
Moral: Clean up after yourself. :)
Although it is quite safe to use such construct in this particular case, there are some caveats for generalising such practice:
run can potentially run out of file descriptors, although unlikely, imagine hunting a bug like that
you may not be able to delete said file on some systems, e.g. win32
if you run anything other than CPython, you don't know when file is closed for you
if you open the file in write or read-write mode, you don't know when data is flushed
The file does get garbage collected, and hence closed. The GC determines when it gets closed, not you. Obviously, this is not a recommended practice because you might hit open file handle limit if you do not close files as soon as you finish using them. What if within that for loop of yours, you open more files and leave them lingering?
Hi It is very important to close your file descriptor in situation when you are going to use it's content in the same python script. I today itself realize after so long hecting debugging. The reason is content will be edited/removed/saved only after you close you file descriptor and changes are affected to file!
So suppose you have situation that you write content to a new file and then without closing fd you are using that file(not fd) in another shell command which reads its content. In this situation you will not get you contents for shell command as expected and if you try to debug you can't find the bug easily. you can also read more in my blog entry http://magnificentzps.blogspot.in/2014/04/importance-of-closing-file-descriptor.html
During the I/O process, data is buffered: this means that it is held in a temporary location before being written to the file.
Python doesn't flush the buffer—that is, write data to the file—until it's sure you're done writing. One way to do this is to close the file.
If you write to a file without closing, the data won't make it to the target file.
Python uses close() method to close the opened file. Once the file is closed, you cannot read/write data in that file again.
If you will try to access the same file again, it will raise ValueError since the file is already closed.
Python automatically closes the file, if the reference object has been assigned to some another file. Closing the file is a standard practice as it reduces the risk of being unwarrantedly modified.
One another way to solve this issue is.... with statement
If you open a file using with statement, a temporary variable gets reserved for use to access the file and it can only be accessed with the indented block. With statement itself calls the close() method after execution of indented code.
Syntax:
with open('file_name.text') as file:
#some code here

Open a File in D

If I want to safely try to open a file in D, is the preferred way to either
try to open it, catch exception (and optionally figure out why) if it fails or
check if it exists, is readable and only then open it
I'm guessing the second alternative results in more IO and is more complex right?
If the file is expected to be there according to normal program operation and the given user input, then use 1 - just try to open the file and rely on exception handling to handle the exceptional situation that the file is not there.
For example:
/// If the user has a local configuration file in his home directory, open that.
/// Otherwise, open the global configuration file that is a part of the program,
/// and should be installed on all systems where the program is running.
File configFile;
if ("~/.transmogrifier.conf".expandTilde.exists)
configFile.open("~/.transmogrifier.conf".expandTilde);
else
configFile.open("/etc/transmogrifier.conf");
Note that using 2 might lead to security issues in your program. For example, if the file is present at the moment when your program checks if the file exists, but is gone when it tries to open it, your program may behave in an unexpected way. If you use 2, make sure that your program still behaves in a desirable way if opening the file fails even though your program just checked that the file exists and is readable.
Generally, it's better to check whether the file exists first, because it's often very likely that the file doesn't exist, and simply letting it fail when you try and open it is a case of using exceptions for flow control. It's also inefficient in the case where the file doesn't exist, because exceptions are quite expensive in D (though the cost of the I/O may still outweigh the cost of the exception given how expensive I/O is).
It's generally considered bad practice to use exceptions in cases where the exception is likely to be thrown. In those cases, it's far better to return whether the operation succeeded or to check whether the operation is likely to succeed prior to attempting the operation. In the case of opening files, you'd likely do the latter. So, the cleanest way to do what you're trying to do would be to do something like
if(filename.exists)
{
auto file = File(filename);
...
}
or if you want to read the whole file in as a string in one go, you'd do
if(filename.exists)
{
auto fileContents = readText(filename);
...
}
exists and readText are in std.file, and File is in std.stdio.
If you're dealing with a case where it's highly likely that the file will exist and that therefore it's very unlikely that an exception will be thrown, then skipping the check and just trying to open the file is fine. But what you want to avoid is relying on the exception being thrown when it's not unlikely that the operation will fail. You want exceptions to be thrown rarely, so you check that operations will succeed before attempting them if it's likely that they will fail and throw an exception. Otherwise, you end up using exceptions for flow control and harm the efficiency (and maintainability) of your program.
And it's often the case that a file won't be there when you try and open it, so it's usually the case that you should check that a file exists before trying to open it (but it does ultimately depend on your particular use case).
I'd say you need to be prepared for an exception to be thrown anyway, otherwise you have a race condition (another process may delete the file between the test and the open etc). So it's best just to go ahead and open, then deal with the contingency.

Mercurial command-line "API" reference?

I'm working on a Mercurial GUI client that interacts with hg.exe through the command line (the preferred high-level API, as I understand it).
However, I am having trouble determining the possible outputs of each command. I can see several outputs by simulating situations, but I was wondering if there is a complete reference of the possible outputs for each command.
For instance, for the command hg fetch, some possible outputs are:
pulling from https://User#server.com/Repo
searching for changes
no changes found
if there are no changes, or:
abort: outstanding uncommitted changes
or one of several other messages, depending on the situation.
I would like to structure my program to handle as many of these cases as possible, but it's hard for me to know in advance what they all are.
Is there a documented reference for the command-line? I have not been able to find one with The Google.
Look through the translation strings file. Then you'll know you have every message handled and be able to see what parts of it vary.
Also, fetch is just a convenience wrapper around pull/update/merge. If you're invoking mercurial programmatically you probably want to keep those three very different concepts separate in your running it so you know which part failed. In your example above it's the 'update' failing, so the 'pull' would have succeeded and the 'update's failing would allow you to provide the user with a better message.
(fetch is an abomination, which is part of why it's disabled by default)
Is this what you were looking for: https://www.mercurial-scm.org/wiki/MercurialBook ?
Mercurial 1.9 brings a command server, a stable (in a sense that API doesn't change that much) and low overhead (there is no need to run hg process for every command). The communication is done via a pipe.

Best way to handle a typical precondition exception?

Which of the following ways of handling this precondition is more desirable and what are the greater implications?
1:
If Not Exists(File) Then
ThrowException
Exit
End If
File.Open
...work on file...
2:
If Exists(File) Then
File.Open
....work on file...
Else
ThrowException
Exit
End
Note: The File existence check is just an example of a precondition to HANDLE. Clearly, there is a good case for letting File existence checks throw their own exceptions upwards.
I prefer the first variant so it better documents that there are preconditions
Separating the pre-condition check from work is only valid if nothing can change between the two. In this case an external event could delete the file before you open it. Hence the check for file existence has little value, the open call has to check this anyway, let it produce the exception.
It's a style thing. Both work well however I prefer option 1. I like to exit my method as soon as I can and have all the checks up front.
Readability of first approach is higher than the second one.
Second option can nest quite fast if you have several preconditions to check; moreover, it suggests that the if/else is somehow in the normal flow, while it is really only for exceptional situations.
As well, expressiveness of first approach is therefore higher than the second one.
As we are talking about preconditions, they should be checked in the beginning of the procedure, just to ensure the contract is being respected; for this reason, the entire check should be somehow separated from the remaining part of the procedure.
For these two reasons, I would definitely go for the first option.
Note: I am talking here about preconditions: I expect that the contract of your function explicitly defines the file as existing, and therefore not having it would be a sign of programming error.
Otherwise, if we are simply talking about exception handling, I would simply leave it to the File.Open, handling that exception only if there is some idea on how to proceed with that.
Every exception must be produced at the appropriate level. In this case, your exception is an open() issue, which is handled by the open() call. Therefore, you should not add exception code to your routine, because you would duplicate stuff. This holds unless:
you want to abstract your IO backend (say your high level routine can either use file open, but also MySQL in the future). In this case, it would be better for client codes to know a more standard and unique exception will be produced if IO issues arise
the presence of a low level exception implies a higher level exception with high level semantic (for example, not being able to open a password file means that no auth is possible and therefore you should raise something like UnableToAuthenticateException)
As for coding style of your two cases, I would definitely go for the first. I hate long blocks of code, in particular under ifs. They also tend to nest and if you choose the second strategy, you will end up indenting too much.
A true precondition is something which, if happens, is a bug in the caller situation: you design a function under certain conditions but they are not hold, so the caller should never have called the function with these data.
Your case of not finding a file could be like this, if the file is required and its existence is checked before in another part of the code; however, this is not quite so, as djna says: file deletion or network failure could cause an error to happen right when you open the file.
The most common treatment is then to try to open the file, and throw an exception on failure. Then, assuming that an exception hasn't been thrown, continue with normal work.

The use of config file is it equivalent to use of globals?

I've read many times and agree with avoiding the use of globals to keep code orthogonal. Does the use of the config file to keep read only information that your program uses similar to using Globals?
If you're using config files in place of globals, then yes, they are similar.
Config files should only be used in cases where the end-user (presumably a computer-savvy user, like a developer) needs to declare settings for an application or piece of code, while keeping their hands out of the code itself.
My first reaction would be that it is not the same. I think the problem with globals is the read+write scenario. Config-files are readonly (at least in terms of execution).
In the same way constants are not considered bad programming behaviour. Config-files, at least in the way I use them, are just easy-changable constants.
Well, since a config file and a global variable can both have the effect of propagating changes throughout a system - they are roughly similar.
But... in the case of a configuration file that change is usually going to take place in a single, highly-visible (to the developer) location, and global variables can affect change in very sneaky and hard to track down ways -- so in this way the two concepts are not similar.
Having a configuration file ususally helps with DRY concepts, and it shouldn't hurt the orthogonality of the system, either.
Bonus points for using the $25 word 'orthogonal'. I had to look that one up in Wikipedia to find out the non-Euclidean definition.
Configuration files are really meant to be easily editable by the end user as a way of telling the program how to run.
A more specialized form of configuration files, user preferences, are used to remember things between program executions.
Global is related to a unique instance for an object which will never change, whereas config file is used as container for reference values, for objects within the application that can change.
One "global" object will never change during runtime, the other object is initialized through config file, but can change later on.
Actually, those objects not only can change during the lifetime of the application, they can also monitor the config file in order to realize "hot-change" (modification of their value without stopping/restarting the application), if that config file is modified.
They are absolutely not the same or replacements for eachother. A config file, or object can be used non-globally, ie passed explicitly.
You can of course have a global variable that refers to a config object, and that would be defeating the purpose.