Wednesday, May 18, 2011

Linux Redhat Clusters name resolution and Split Brain issues

How Broadcast Signaling Works
Placing this issue in the right perspective requires first understanding something about how broadcast signaling works. When a member is invoked, it is supposed to issue a broadcast message to the network. The member does so using its cluster name as an identifier. If there happens to be a cluster present on the network with the same cluster name, that cluster is expected to reply to that broadcast message with its node name. In that case, the joining member should send a join request to the cluster. The default port being used by the cluster for issuing broadcast messages is 5405.
You will see these messages when opening the NIC in promiscuous mode with either wireshark or tcpdump:

22:25:36.455369 IP server-01.5149 > server-02.netsupport: UDP, length 106
22:25:36.455531 IP server-02.5149 > server-01.netsupport: UDP, length 106
22:25:36.665363 IP server-01.5149 > UDP, length 118
22:25:36.852367 IP server-01.5149 > server-02.netsupport: UDP, length 106
22:25:36.852526 IP server-02.5149 > server-01.netsupport: UDP, length 106

How Name resolution works and cluster.conf

Cman (the Cluster Manager) tries hard to match the local host name(s) to those mentioned in cluster.conf. Here's how it does it:

1. It looks up $HOSTNAME in cluster.conf
2. If this fails it strips the domain name from $HOSTNAME and looks up that in cluster.conf
3. If this fails it looks in cluster.conf for a fully-qualified name whose short version matches the
short version of $HOSTNAME
4. If all this fails then it will search the interfaces list for an (ipv4 only) address that matches a
name in cluster.conf
cman will then bind to the address that it has matched.
Note: we will have to make sure the settings in /etc/nsswitch.conf are set to:
hosts:      files nis dns
nsswitch.conf is a facility in Linux operating systems that provides a variety of sources for common configuration databases and name resolution mechanisms.

Split Brain Issues

One of the most dangerous situations that can happen in clusters is that both nodes become active at the same time. This is especially true for clusters that share storage resources. In this case both cluster nodes could be writing to the data on shared storage which will quickly cause data corruption.
When both nodes becoming active it is called “split brain” and can happen when a cluster stops receiving heartbeats from its partner node. Since the two nodes are no longer communicating they do not know if the problem is with the other node or if the problem is with itself.
For example say that the passive node stops receiving heartbeats from the active node due to a network failure of the heartbeat network. In this case if the passive node starts the cluster services then you would have a split-brain situation.
Many clusters use a Quorum Disk to prevent this from happening. The Quorum Disk is a small shared disk that both nodes can access at the same time. Whichever node is currently the active node writes to the disk periodically (usually every couple of seconds) and the passive node checks the disk to make sure the active node is keeping it up to date.
When a node stops receiving heartbeats from its partner node it looks at the Quorum Disk to see if it has been updated. If the other node is still updating the Quorum Disk then the passive node knows that the active node is still alive and does not start the cluster services.
Redhat clusters support Quorum Disks, but Redhat support had recommended not to use one since they are difficult to configure and can become problematic. Instead they recommend to relying on Fencing to prevent split brain.