Saturday, August 15, 2009

Return value from system() is not reliable

Recently I happen to use this nifty utility available on the Linux platform to perform some maintenance work, a bit of housekeeping, before my application starts up. The need for this is not really important. Essentially, the environment needed to be configured for my application to start up and start functioning properly. This, ideally, should have been done by a configuration/setup script, probably written in Python or a similar programming language, which are meant for such tasks. But unfortunately I did have that privilege and I had to do every bit of it from my C-program. The task were simple and very regular, like clearing a workspace directory, setting appropriate permissions and the like. The initial thought was to use the dirent family of functions aka OS system calls to read the filesystem and modify it programatically. But doing that whole thing was a big PITA. Hence I took the easy route and simply used the system() function, which will execute shell commands.

The problem with this easy approach is that tracking the operation's success is really hard. The system() man page says that the function will return the actual value returned by the command that we pass to it to be executed. But sadly this is not how things are, at least on the Linux 2.6 machine on which I am developing and running my code. The return value from this system() function is highly unreliable. In fact the man page also puts it in there, but in a very subtle way. Here is a quote from the man page:
     The <b>system</b>() function returns the exit status of the shell as returned by
<b><a href="http://www.manpagez.com/man/2/waitpid/">waitpid(2)</a></b>, or -1 if an error occurred when invoking <b><a href="http://www.manpagez.com/man/2/fork/">fork(2)</a></b> or
<b><a href="http://www.manpagez.com/man/2/waitpid/">waitpid(2)</a></b>. A return value of 127 means the execution of the shell
failed.


I am not fully clear with the process semantics, but from my observations when I execute a command using this system() function the command would have executed successfully, as in, the corresponding operation would have been completed, and yet the return value would be -1, telling me that the execution of the command has failed somewhere but it does not tell me where. For example, if it is a command to clear a directory and create some other directory structure there, all these tasks would be completed. The old directory would be gone and the new ones created. I see that when I just navigate to that location from the command line, but system() would have returned -1. I initially was checking the return value to handle the failures and was taking some fail safe steps. But all that was happening even when it was all good. The logs repeatedly told me that the operations were failing where as it was all good actually.

The reason for this is not known to me. It probably lies in the quote from the man page that I have put above. As it says -1 can be returned for any of the errors, be it error from fork or wait. Now if the error was from wait, which I am guessing is the case, it makes a little sense. I think the fork and exec went through properly and the command performed the required action without any error. But later the wait failed and the system() could not collect the exit status and hence returned -1. This is the only thing that I can think of. Nevertheless bottom line is, Return value from system() is not reliable.


No comments:

Post a Comment