FAIL RECOVERY

Top  Previous  Next

In any client-server environment there can be problems submitting jobs to the server.  The problems can be a result of a network problem, or a license problem, or just a mistake in the client command line, such as a bad rule file or output destination.  In the client's uf100c.ini file, a failhist=days value can be specified so the client will retain job input, command lines, and error messages when a job fails to run at the server.

 

When this parameter is set to a number greater than 0, then the client will store failure history so that the failed jobs can be resubmitted, optionally with overridden command line options to correct problems related to command line mistakes.

 

Failed Job Storage

The queue for failed jobs is found in one of two places:

 

Linux clients store failure history in "failhist" under the client's home path (where uf100c.pl is located)
Windows clients store failure history in %ProgramData%\SDSI\uf100\failhist

 

Under this directory, history is stored by date in directories formatted as yyyymmdd.  Each failed job is stored with three files.  A .in file contains the job's print stream, a .arg file contains the job's command line arguments, and a .err file contains the error message pertaining to that job's failure.

 

Failed job history is maintained for the number of days specified in the failhist parameter of uf100c.ini.  Directories older than that date in the failhist directory are removed as jobs are run.

 

Recovery

To re-run a job, or several jobs, you use the command line client, uf100c on Linux or uf100cc.exe on Windows, with the -rerun option.  Arguments before the -rerun option are used as overrides to any found in the failed job's arguments file, so you can use this to fix problems with the original command line.  For example, if jobs failed because of a bad rule file name, you can specify a new rule file with -f correctname.rul in the command line before the -rerun option.

 

After the -rerun should be one or more file name references from the failhist directory structure.  These names can be plain file names or full path names. They can include any of the .in, .arg, or .err extensions.  The process of re-running jobs normalizes the names.  There are three special circumstances:

 

If no names follow -rerun, the standard input is read for a list of files to process.  This makes it easy to use script routines to provide the list of names.  For example, on Linux the grep command can be used to get a list of files containing a certain message, and the list passed to the uf100c command line:
 
grep --files-with-matches 'Bad rule file' failhist/*/*.err | uf100c -f myrule.rul -rerun
 
Use -rerun all to process all files in all failhist directories are re-submitted.  This could be used, for example, to recover from a down server.
 
Use -rerun yyyymmdd to process all failhist files for the specified date.