Its is very common in our environment for the SQL guys (we call them DBAs) to request for new SAN drive to be added to their SQL clusters to expand their database.
This happened on a normal weekend, my colleague created the SANdrive, setup the cluster resource for that disk and handed over to the DBA to configure. Come Sunday night, there is an automatic reboot of the server and it reboot. However, when it came up, the new disk was corrupted! We could see folders but not files in the cluster. As a result, my colleagues had to work over Sunday till Monday to get the disk fixed. The disk was recreated, reformatted and the DBA restored the database files on Monday.
Now all eyes were on this colleague of mine who did this new SAN disk add. Did he make a mistake?
Now the standard process for added shared disk on msclusters is as follows:
- Shutdown on node (this allows the other node to have full ownership of the disk)
- Format the shared disk and setup the drive letter as needed
- Use cluster admin to add the new disk as a resource and set it up into a cluster group
- You may want to reboot (Windows 2003 is more stable, but you don’t know) to ensure that the new disks works well after a reboot
- Get your node up and power up the other node that was shutdown
- Failover and test access to the new disk
In general this is how we create cluster disks and luckily this was exactly what my colleague did. His initial reboot did not bring up a corrupted drive and there was no good reasons why that drive would corrupt upon a reboot.
So we raised this issue to Microsoft, but they could not find anything, except somehow they detected that the new disk was deleted from the cluster a few hours later (before the reboot anyway).
After a few queries to our DBAs, we found out what really happened. The configured cluster was handed over to one of the DBAs to setup their database. This DBA was quite new to mscluster and when he found that the newly disk was not in the correct cluster group, instead of just moving to the correct cluster group, he deleted the cluster disk resource and recreated it… all while both nodes are up and running!
This must have cause contention in the ownership of the disk resource when the disk was “unmanaged” from the cluster. It was a free for all as both nodes would be trying to own the disk for itself. Despite have not additional [serious] errors upon recreating the disk in the cluster, a reboot of the cluster cause this latent problem to manifest itself as a corrupt SAN disk.