• Management Pack:  HPC Server
  • MP Version:  3.1.3266.0 for HPC Server 2008 R2
  • Released:  2/14/2011
  • Publisher:  Microsoft

Failed Job Proportion Monitor

  • ID:  Microsoft.HPC.2008R2.Monitor.JobScheduler.Performance.FailedJobs
  • Description:  Failed job proportion performance monitor for HPC 2008 R2 Job Scheduler
  • Target:  HPC 2008 R2 Job Scheduler
  • Enabled:  Yes

Operational States

Name State Description
Low Success  
Medium Warning  
High Error  

Overridable Parameters

Parameter Name Default Value Description Override
Low Threshold 20  
High Threshold 70  
Timeout Seconds 300  
Interval Seconds 900  
Sync Time    

Alert Details

Monitor State Message Priority Severity Auto Resolution
High (Error) Failed Jobs Proportion has exceeded the upper threshold Medium Critical Yes

Run As Profiles

Name
HPC Server Admin Action Acount

Monitor Knowledgebase

Summary

This monitor tracks the percentage of failed jobs out of the total number of finished jobs. Finished jobs are those jobs that are at the Finished, Canceled, or Failed state. A large percentage of failed jobs may indicate that the health of the HPC Job Scheduler Service is in a warning or critical level.

The health levels are defined as below:

  • Healthy – The number of failed jobs is less than or equal to 20% of total number of finished jobs.

  • Warning – The number of failed jobs is greater than 20% and less than or equal to 70% of the total number of finished jobs.

  • Critical – The number of failed jobs is greater than 70% of the total number of finished jobs.

Causes

Failed jobs can be caused by any of the following:

  • Application failures. These failures are indicated by applications that return a non-zero exit code, and could have a variety of causes.

  • Node failures.

  • Network failures.

  • Storage failures (often caused by network failures).

  • Submission errors, such as bad file or directory names for tasks.

Resolutions

To troubleshoot and fix this problem:

  • Check for the reason of the job failure. The reason of the job failure can be determined by using HPC Cluster Manager.

  • If the job failed because of a failure of one or more tasks in the job, check for the reasons of the failure in the output for the failed tasks within the job.

  • If the job failure is because of a node failure, check that your nodes are online and that you have network connectivity to your nodes.

  • Check the health state of nodes that the failed jobs ran on. Click the State view in the Compute Node folder and check the nodes that these jobs ran on.

External References
This monitor does not contain any external references.

See Also for HPC Server Management Pack


Downloads for HPC Server Management Pack

AZURE OPTIMIZATION ASSESSMENT GET STARTED
MIGRATION TO AZURE GET STARTED
SYSTEM CENTER MIGRATION TO AZURE GET STARTED
MIGRATION TO AZURE FOR SQL AND WINDOWS 2008 GET STARTED