Salt * Wet * Bytes

June 30, 2008

MSCluster: Low free PTEs caused cluster service to disconnect

Filed under: WinCluster, Windebug — saltwetfish @ 8:04 am
Tags: , , ,

Back in the days when a lot of us are not sure what the 3GB switch really does and thought its must be set so that Windows can recognise 4GB RAM and above, a number of our application servers has been set with 3GB switches in their servers. This is one of the servers.

The other day one of the application cluster suddenly failed over. A quick check on the servers’ eventlogs show not issues with low non-paged pool memory or memory issues nor any network issues. The application logs was rather clean, other than a strange repetitive event from the MOM agent. We have a keep alive event which the MOM agent runs once a day, but I was seeing the same event running 2 twice every minute. On another server when had the same program we could see that the idling node had cpu running at 20% or more. Once we stopped the MOM agent, the cpu dropped to almost idle. (more…)

February 4, 2008

MSCluster: A rookie mistake dealing with shared disks

Filed under: WinCluster, Windebug — saltwetfish @ 3:48 am
Tags:

Its is very common in our environment for the SQL guys (we call them DBAs) to request for new SAN drive to be added to their SQL clusters to expand their database.

This happened on a normal weekend, my colleague created the SANdrive, setup the cluster resource for that disk and handed over to the DBA to configure. Come Sunday night, there is an automatic reboot of the server and it reboot. However, when it came up, the new disk was corrupted! We could see folders but not files in the cluster. As a result, my colleagues had to work over Sunday till Monday to get the disk fixed. The disk was recreated, reformatted and the DBA restored the database files on Monday.

Now all eyes were on this colleague of mine who did this new SAN disk add. Did he make a mistake?

Now the standard process for added shared disk on msclusters is as follows:

  • Shutdown on node (this allows the other node to have full ownership of the disk)
  • Format the shared disk and setup the drive letter as needed
  • Use cluster admin to add the new disk as a resource and set it up into a cluster group
  • You may want to reboot (Windows 2003 is more stable, but you don’t know) to ensure that the new disks works well after a reboot
  • Get your node up and power up the other node that was shutdown
  • Failover and test access to the new disk

In general this is how we create cluster disks and luckily this was exactly what my colleague did. His initial reboot did not bring up a corrupted drive and there was no good reasons why that drive would corrupt upon a reboot. (more…)

Blog at WordPress.com.