Easy IP failover (Debian squeeze)

Say you have a bunch of servers on a fast private network, running a busy web site. And you need a gateway server so you can reach those from the outside world. You might run a proxy on a gateway server to expose that. However there could be a problem accessing your site if that server went down.

To fix that you might add a second identical backup server. Lets call those fw1 and fw2.

        internal servers
 ------------------------------- <-- private network (10.0.0.x)
    ^                      ^
    |                      |
  [fw1]                  [fw2]   <-- (physically) seperate servers
    |                      |
 ---^----------------------^---- <-- public network
      the big wide world

You could use DNS failover to make the site respond from your second (fw2) server if the main one (fw1) goes down. That is an inexpensive solution, and works quite well. But can be a bit slow to failover in some cases, and failing back can be hard to get right. So there is the risk that visitors to your site may loose access for a short time anyway.

The solution here is to implement ‘proper’ cluster fail-over. Enter the Pacemaker suite. This can be integrated with a many readily available tools such as Heartbeat, Corosync, and OpenAIS, and has a wide range of features in a consistent CLI. And it is available in most popular distributions under the GNU v2 license.

Prepping each server

Lets start with just one server or node. It is nice to begin from a clean install with this type of setup, since that reduces complexity. By keeping things simple and consistent in our setup, it should be easier to manage and faster to replicate.

Tip: For our purposes a ‘node’ is a single member of a cluster. That can be any type of servers both physical or virtual that supports the toolset we will use below.

Disable unneeded services. I like to also install a few handy tools…

for service in apache2 postfix dovecot; do
  update-rc.d $service disable
  service $service stop
done
apt-get -y install vim sysv-rc-conf

Make sure the file-system has a unique UUID. Especially if you are using VPS images like I did while testing. This can help Pacemaker (and associated tools) uniquely identify each server.

apt-get -y install uuid

echo "Original UUID is $(blkid /dev/xvda1)"
tune2fs -U "$(uuid -1)" /dev/xvda1; sleep 5;
echo "New UUID is $(blkid /dev/xvda1)"

If you use UUIDs in /etc/fstab you will also need to correct those to match the new UUID so file-systems mount correctly on boot. The ‘sleep’ forces a small pause since the scriptlet finishes so fast the new UUID is not always written to disk yet.

Next set hostnames to be short, as a convenience (sysadmins like shorter commands!), and to make things more predictable later in our network setup. (E.G. instead of ‘ping fw1.example.com’, we only need ‘ping fw1’).

sed -i.gres 's/\.[a-z].*//g' /etc/hostname
/etc/init.d/hostname.sh

Now you can add details of each host in your cluster to /etc/hosts and they will be picked up by pdns-recursor. So you do not have to do that on all your hosts. There are several ways to manage this on separate hosts but we will do it the simple way here for now. For example it might look a bit like this…

root@fw1:~# cat /etc/hosts
127.0.0.1 localhost
10.0.0.11 fw1.example.com fw1
10.0.0.12 fw2.example.com fw2

At this stage the only open network ports should look something like the following. As you can see ssh is the only item listening to the outside world.

root@fw1:~# netstat -plnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address  Foreign Address State    PID/Program name
tcp        0      0 0.0.0.0:22     0.0.0.0:*       LISTEN   1729/sshd

Do the same on any other servers you plan to have in the cluster. Then it is a good time to reboot each node to make sure each node comes up running the way you expect.

Setting up the Pacemaker tools

This guide is largely based on the excellent documentation “Clusters from Scratch” (pdf)

On each node install all the software we will need for failover to work.

apt-get install -y pacemaker

That will pull in Corosync packages as a standard dependency. You could install Heartbeat and use that instead if you prefer. In my tests Corosync seems to be very responsive and more scalable. Lets configure that now.

cp /etc/corosync/corosync.conf.example /etc/corosync/corosync.conf

export ais_port=5405            # debian default value
export ais_mcast=226.94.1.1     # debian default value (multicast address)
export ais_addr=10.0.0.0        # monitor private network interface only

sed -i.gres "s/.*mcastaddr:.*/mcastaddr:\ $ais_mcast/g" /etc/corosync/corosync.conf
sed -i.gres "s/.*mcastport:.*/mcastport:\ $ais_port/g" /etc/corosync/corosync.conf
sed -i.gres "s/.*bindnetaddr:.*/bindnetaddr:\ $ais_addr/g" /etc/corosync/corosync.conf

# log configuration fixup
sed -i.gres 's/.*logfile:.*\/tmp/logfile:\ \/var\/log\/corosync/' /etc/corosync/corosync.conf

# protocol version fixup
sed -i.gres 's/compatibility.*/compatibility:\ none/g' /etc/corosync/corosync.conf

# tell corosync to run as root, required to manage IP interfaces
cat <<-END >>/etc/corosync/corosync.conf
aisexec {
user: root
group: root
}
END

# tell Corosync to start Pacemaker services also
cat <<-END >>/etc/corosync/corosync.conf
service {
name: pacemaker
ver: 0
}
END

The observant among you will see we have only set this up for monitoring on the private network interface. This is so we can play with it without risk of affecting others on the public network. If you need to do this with a public interface, talk to us as we can help you make sure that will work reliably on our network.

Copy the above steps from /etc/corosync/corosync.conf to each server you want to have in the front end, and then do the following on each…

sed -i.gres 's/START=no/START=yes/' /etc/default/corosync
service corosync start

Its alive..!

You now have a running cluster! It is not doing much yet, but you should be able to use the Pacemaker tools to see what is occurring on your cluster. Something like the following…

root@fw1:~# crm status
============
Last updated: Fri Jun 10 05:58:01 2011
Stack: openais
Current DC: fw1 - partition WITHOUT quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ fw2 fw1 ]

Note that now you can now monitor nodes, enable alerts, and assigne/manage cluster resources using commands from any single active server in the cluster, there is normally no need to log in to multiple servers each time you want to make a change or run status checks on the cluster.

To add new servers to the cluster just rinse and repeat the above steps. It is that easy.

Tip: Corosync works best with an uneven number of servers. Internally it uses a voting system for decisions. In most cases majority rule determines the right thing to do, which can break if you have a 50/50 split.

Adding IP failover to the cluster, an example

Lets add some resources and tweak a few things. More information about these directives and what they do is available in the document I linked to earlier. From a cluster member on the command line run…

crm configure property stonith-enabled=false
crm configure rsc_defaults resource-stickiness=100
crm configure primitive ClusterGateway ocf:heartbeat:IPaddr2 \
    params ip="10.0.0.1" cidr_netmask="24" \
    op monitor interval="10s"

And now you can see something like the following…

root@fw1:~# crm status
============
Last updated: Sat Jun 11 04:42:02 2011
Stack: openais
Current DC: fw1 - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
1 Resources configured.
============

Online: [ fw2 fw1 ]

 ClusterGateway (ocf::heartbeat:IPaddr2):       Started fw1

Right away you have a fail-over IP enabled, if you reboot the fw1 node you should see that resource move to another almost immediately. Note that it is only running on one server at a time, each resource you need has to be defined separately.

It really is that easy…

What you can do now…

move resources between nodes (crm resource move ClusterGateway fw2)
try restarting the node running the resource and see what happens to that (with ‘crm_mon’)
configure monitoring and alerts (use crm_mon in daemon mode)
configure a resource to manage an haproxy load balancer.
configure a resource to manage nginx (for SSL translation against the load balancer)
configure a resource to manage a dns recursion tool (to give your internal servers public dns)
add a dependency between to the ClusterGateway resource
much much more…

Stuff to think about…

Keep cluster nodes as simple as possible, they are faster to cycle/reboot if there is a problem with one, and to rebuild if needful. And you don’t require expensive hardware, even with an active firewall in place.
Keep services on separate servers so they wont be affected by any goings on other machines. Its easy to leverage our VPS plans to help with this approach
Corosync requires a little bit of bandwidth to communicate with other servers. That is fine, but if you have lots of nodes be aware the network they are on can get ‘noisy’ as a result. We can provision public and private network segments to help keep that traffic discreet and minimal.
When running a firewall make sure to white-list the ports needed for Corosync. Those are available in the documentation, and can be varied with configuration tweaks
You can clone an image easily, and add that to your network quickly. Remember to update any disk UUIDs and other unique identifiers

For those wondering what the difference is between the various parts of a pacemaker stack, a good summary is available.

You can also check out the official docs for the latest version and the Pacemaker wiki for community updates.

There are also some other practical implementation examples available. Note that in general heartbeat and corosync can be used interchangeably, pick the one that works for you…

http://blog.foaa.de/2010/10/intro-to-pacemaker-part-2-advanced-topics/
http://www.zivtech.com/blog/setting-ip-failover-heartbeat-and-pacemaker-ubuntu-lucid
https://wiki.ubuntu.com/ClusterStack/LucidTesting
http://doc.opensuse.org/products/draft/SLE-HA/SLE-ha_draft/index.html