Traditional High Availability (HA) NVMe-based installations with parallel file systems are based on Storage Bridge Bay (SBB) servers, combining two servers and multiple dual ported NVMe drives, connected to both servers simultaneously.
This architecture has its advantages but comes with a main challange: the proper sizing of each server node; the goal of implementing HA with SBB is to be able to survive a node failure, by moving all the tasks from the failed node to the survived one.
If each of node have been sized to the expected workload in normal operations, in case of failover, the performance of this node will degrade, slowing down the filesystem; if the node had been sized to run all the components of the cluster simultaneously, it will significantly increase the cost of the solution.
A more advanced approach to supports high availability and solving the sizing challenge is to build an architecture based on the N active nodes and M (significantly less than N) idle nodes.
Such a cluster could survive any component failure.
To prove the validity of this architecture, we’ll present different proof of concepts and real industrial implementations.

