This monitor tracks the percentage of failed jobs out of the total number of finished jobs. Finished jobs are those jobs that are at the Finished, Canceled, or Failed state. A large percentage of failed jobs may indicate that the health of the HPC Job Scheduler Service is in a warning or critical level.
The health levels are defined as below:
Healthy – The number of failed jobs is less than or equal to 20% of total number of finished jobs.
Warning – The number of failed jobs is greater than 20% and less than or equal to 70% of the total number of finished jobs.
Critical – The number of failed jobs is greater than 70% of the total number of finished jobs.
Failed jobs can be caused by any of the following:
Application failures. These failures are indicated by applications that return a non-zero exit code, and could have a variety of causes.
Storage failures (often caused by network failures).
Submission errors, such as bad file or directory names for tasks.
To troubleshoot and fix this problem:
Check for the reason of the job failure. The reason of the job failure can be determined by using HPC Cluster Manager.
If the job failed because of a failure of one or more tasks in the job, check for the reasons of the failure in the output for the failed tasks within the job.
If the job failure is because of a node failure, check that your nodes are online and that you have network connectivity to your nodes.
Check the health state of nodes that the failed jobs ran on. Click the State view in the Compute Node folder and check the nodes that these jobs ran on.