Optimizing Performance with the Server Cluster Recovery Utility

How to Use the Server Cluster Recovery Utility for Fast Failover

Purpose

Quickly restore clustered services and minimize downtime by using the Server Cluster Recovery Utility (SCRU) to detect failures, recover nodes, and trigger fast failover.

Prerequisites

  • All cluster nodes reachable via management network and SSH/WinRM.
  • Valid backups of cluster configuration and critical data.
  • SCRU installed on a management host with credentials for cluster nodes.
  • Quorum/witness configured and known.

Quick checklist (order of operations)

  1. Assess cluster health
    • Run SCRU discovery/health command to list node statuses and quorum state.
  2. Isolate failed node(s)
    • Mark unhealthy nodes as maintenance/drain to prevent split-brain (SCRU maintenance set).
  3. Restore quorum if needed
    • If quorum lost, bring witness online or assign votes to reach majority (use SCRU quorum-repair).
  4. Recover or replace node
    • For recoverable node: run SCRU node-repair (checks services, mounts, network, storage).
    • For unrecoverable: remove from cluster and add rebuilt node using SCRU node-replace.
  5. Failover services
    • Trigger controlled failover of clustered roles to healthy node(s) with SCRU failover –graceful.
    • If rapid switch required, use SCRU failover –force (only if graceful fails).
  6. Verify services and data
    • Run SCRU verify to confirm resources online, disk mounts intact, and replication healthy.
  7. Post-recovery hardening
    • Reintroduce repaired nodes with SCRU rejoin, rebalance ownership, and restore votes.
    • Run full cluster validation and schedule follow-up backup.

Common SCRU commands (example syntax)

  • Discover/health:

    Code

    scru status –cluster mycluster
  • Set maintenance:

    Code

    scru node set-maintenance –node node1 –reason “hardware fault”
  • Quorum repair:

    Code

    scru quorum repair –cluster mycluster –witness add://fileshare/path
  • Node repair:

    Code

    scru node repair –node node2 –checks network,services,storage
  • Force failover:

    Code

    scru failover –resource web-service –target node3 –force
  • Verify:

    Code

    scru verify –cluster mycluster –level full

Fast-failover best practices

  • Enable dynamic quorum and automatic witness where supported.
  • Keep automated health checks and preflight validation scripts active.
  • Use graceful failover by default; reserve –force for emergencies.
  • Maintain recent configuration backups and tested rebuild playbooks.
  • Test failover and full recovery in staging quarterly.

Troubleshooting tips

  • If cluster won’t start after quorum fix, inspect quorum log and evict stale node IDs.
  • For split-brain, prefer restoring the majority partition and re-sync data from authoritative nodes.
  • If shared storage shows inconsistent ownership, run SCRU storage-repair with snapshots disabled.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *