This monitor tracks the aggregated proportion of compute nodes, workstation nodes, and broker nodes in the cluster that can be reached. By default this monitor enters the warning state when 20% of the nodes are unreachable, and it enters the critical state when 80% of the nodes are unreachable.
The health levels are defined as below:
Healthy – The number of unreachable nodes is less than or equal to 20% of total number of nodes.
Warning – The number of unreachable nodes is greater than 20% and less than or equal to 80% of the total number of nodes.
Critical – The number of unreachable nodes is greater than 80% of the total number of nodes.
One or more nodes can be unreachable for the following reasons:
The HPC Node Manager Service is down on the node.
The name resolution has failed.
The node is disconnected from the network.
The node has been shut down.
To resolve this problem:
To identify and troubleshoot unreachable nodes, check the status of the Node Connectivity monitor for the compute nodes, workstation nodes, and broker nodes in the cluster.