Testing a RHCS cluster
Test NIC Bonding
Redundant ExNet and NIC Bonding is a requirement of how we build RHCS Clusters (to reduce the chance of a cluster split). As it may only have been added after the server QC was completed, it should be tested again here as part of the cluster testing.
# ifenslave -c bond0 eth1 && ping -c5 `route -n | awk '/^0.0.0.0/{print $2}'`;\
# ifenslave -c bond0 eth0 && ping -c5 `route -n | awk '/^0.0.0.0/{print $2}'`;\
# tail -25 /var/log/messages | grep bonding && cat /proc/net/bonding/bond*
Basic Relocation Between Nodes
you should first test that each service group relocated cleanly and will run happily on both nodes
# clusvcadm -r mysql-svc -m <standby node>
Breaking the Active Node
the biggest test is to break networking of the active node for the particular service. This tests:
Heartbeat and quorum
Fencing. Is the remaining node able to successfully fence the down node?
Service recovery. Does the service restart successfully on the remaining node?
Node rejoining the cluster after fencing
First, tail the syslog on the inactive node:
# tail -fn0 /var/log/messages
Then, once you are ready, break networking on the active node. note that it is a good idea to force a sync before doing this.
# sync && ifdown bond0
If all goes well, the second node will detect that the first has died, fence it, and take over the down service. The down node will then rejoin after it is finished rebooting. This needs to be repeated for each node in the cluster.
Unmount file systems
The next few tests concern the ability to detect partial-fault conditions in service groups, i.e. when a particular resource is no longer available.
Following the recovery policy that was selected, the service will either restart on the same node or relocate on the other node.
Run the following to check the policy:
# grep recovery /etc/cluster/cluster.conf
First, try to force lazy-unmounting the filesystems for each service in turn. (This only has to be tested on one node).
Note that for MySQL, this will fail if the my.cnf in the mysql resource is on the SAN disk because rgmanager checks whether the file is accessible. this is why the resource should point to /etc/my.cnf which, itself, includes the SAN my.cnf:
# umount -fl /san/mysql-fs && tail -fn0 /var/log/messages
Remove IPs
Similarly, the cluster should detect when the cluster floating IP is no longer present (replace 192.168.100.100 with the cluster floating IP):
# ip addr del 192.168.100.100/24 dev bond0 && tail -fn0 /var/log/messages
Stop/Kill Daemons
This last test is a little less clear-cut. what we are after is proof that if the main daemon for a service group returns a bad status-check, then the service group will be torn down and restarted correctly.
For MySQL services, we can send a SIGKILL to mysqld_safe and mysqld at the same time. Note that if you just kill mysqld, then mysqld_safe will automatically restart it without cluster interventions - and if you send them a SIGQUIT/SIGTERM, then they have a habit of clearing up their PID files on exit, which causes problems with the ‘stop’ procedure of some initscripts or of the RHCS MySQL resource type script
# pkill -9 mysqld && tail -fn0 /var/log/messages
For NFS, stopping the NFS services (nfsd, lockd, etc.) should cause RHCS to detect a problem (This test does NOT work with RHEL 6):
# service nfs stop && tail -fn0 /var/log/messages
for other types of services, you may have to use your imagination and/or common sense. Most services based around an initscript can be tested by manually calling a ‘stop’ of that initscript
fsck all file system
Once you are done testing, you should fsck all file systems (both SAN volumes for the clustered services, and all system volumes)
Remember that between testing of fencing and with any possible hiccups in the build process, these file systems may have been uncleanly unmounted several times.
FSCK all File Systems
for every cluster service:
# clusvcadm -d mysql-svc
# fsck -f /dev/vgsan00/mysql00
# clusvcadm -e mysql-svc
For system volumes, on each node in turn:
# touch /forcefsck; shutdown -r now
clean up any core dumps
# rm /root/core.*
Clear Bash History
# rm -f /root/.mysql_history /root/.bash_history; history -c
Last updated
Was this helpful?