Microsoft Cluster Services (MSCS) and Failover Clustering (FC) are the Server 2003 and Server 2008 clustered high-availability solutions, respectively. Collectively, they’re commonly referred to by the name “cluster,” and they’re often used to create failover strategies with minimal downtime between servers in the same physical location. Microsoft has extended clustering across the WAN with Distributed Failover Clustering, but so far it’s only available for Exchange 2007 in the form of CCR on Server 2008 in a 2-node, Active/Passive config. To replicate and fail over between sites, between clusters, or both; Double-Take Availability offers a great way to allow multi-node clusters to replicate and fail over to other machines in the same or different locations.
One of the most frequent questions we get about failing over clusters, is why we here at Double-Take Software have decided to recommend that you only fail over the entire cluster at once, as opposed to moving individual resource groups if they fail on one or more nodes. The answer has much more to do with why you’d have a fail over condition than how the mechanics work.
When a resource or group fails in a cluster, another node of the cluster designated as a possible owner will automatically take over. So unless multiple nodes and/or multiple resources fail, there will be no reason to move to the other hardware in the DR site. If a resource group fails on one node, then goes on to fail on other nodes, there is a serious issue with the cluster as a whole, and the still-running resource groups are now suspect themselves. At that point, it would be better to fail over the entire cluster and troubleshoot the problem on the original servers.
So, with a few exceptions, our recommendation to fail over either the entire cluster or nothing is based on how clusters work. We do permit failing over one resource group to another cluster, provided that you manually configure the fail over systems to handle that, and that the application in question will support it. File resource groups are not generally a challenge for single-group failover. Exchange resource groups (2003 or 2007) on the other hand cannot move independently. This is because the groups of an Exchange cluster are designed to work as one logical unit, and therefore shouldn’t be split between clusters if they were built on a single, contiguous cluster.
Moving all resource groups for a cluster at once makes good sense, based on how clusters function and how you’d normally enter into a failover condition. There are exceptions to every rule, but when you look at all the variables, complete cluster failover is best.
Filed under: Best Practices, DT 101, Double-Take Availability, Workload Availability