RSS

Tag Archives: mscluster

Identifying Stale Cluster Computer Objects

From: http://blogs.msdn.com/b/clustering/archive/2011/08/17/10197069.aspx

In summary:

  • the pwdLastSet property of the CNO and VCO in AD should be refreshed every 30 days by default (just like workstation passwords). If the last password set is much longer than 30 days, it should indicate stale cluster objects.
  • By default, when a cluster Network Name resource is deleted or if a cluster is destroyed, the CNO and VCO’s are placed in a disabled state.  Any cluster computer object which is in a Disabled state is no longer being used by the cluster.
  • When destroy a cluster, use –CleanupAD switch to the Remove-Cluster in powershell to remove the CNO and VCO instead of putting them in a disabled state as above.
  • CNO and VCO contains SPN “MSClusterVirtualServer”, so you can identity which computer object is cluster by quering its SPN service.

 

 

 

Advertisements
 
Leave a comment

Posted by on August 18, 2011 in Windows

 

Tags:

Network corruption causing MS Clusters IP resource to fail

Recently, we had an Windows 2003 MS cluster, in which the cluster IP address failed to start. When we looked at the network resources, we found that both the public network was marked as “unavailable”. This was strange because the public IP was okay, and obviously so, because we are accessing the servers via RDP via the public IP. Because the network resource was unavailable to the cluters, all IP resources could not start because they depended on the public networks to be available. Read the rest of this entry »

 
Leave a comment

Posted by on June 4, 2010 in Windows

 

Tags:

MSCluster: A rookie mistake dealing with shared disks

Its is very common in our environment for the SQL guys (we call them DBAs) to request for new SAN drive to be added to their SQL clusters to expand their database.

This happened on a normal weekend, my colleague created the SANdrive, setup the cluster resource for that disk and handed over to the DBA to configure. Come Sunday night, there is an automatic reboot of the server and it reboot. However, when it came up, the new disk was corrupted! We could see folders but not files in the cluster. As a result, my colleagues had to work over Sunday till Monday to get the disk fixed. The disk was recreated, reformatted and the DBA restored the database files on Monday.

Now all eyes were on this colleague of mine who did this new SAN disk add. Did he make a mistake?

Now the standard process for added shared disk on msclusters is as follows:

  • Shutdown on node (this allows the other node to have full ownership of the disk)
  • Format the shared disk and setup the drive letter as needed
  • Use cluster admin to add the new disk as a resource and set it up into a cluster group
  • You may want to reboot (Windows 2003 is more stable, but you don’t know) to ensure that the new disks works well after a reboot
  • Get your node up and power up the other node that was shutdown
  • Failover and test access to the new disk

In general this is how we create cluster disks and luckily this was exactly what my colleague did. His initial reboot did not bring up a corrupted drive and there was no good reasons why that drive would corrupt upon a reboot. Read the rest of this entry »

 
Leave a comment

Posted by on February 4, 2008 in Windows

 

Tags:

Windows cluster service did not start

Problem:

The problem started when I configured a new printer and was looking at the configuration of a TCPIP port via one of the virtual nodes (let’s called it 01A) and it hung. I tried to kill the spooler process and restarted it, but it would be at “Start pending” mode within cluster admin. So I force killed the spooler service and failed the resources over to node 2 and had node 1 rebooted.

When node 1 came up again, the cluster service was in “starting” mode and will not join the cluster.

When I looked at the cluster.log file, I could see that the cluster had tried to load its cluster hive unsuccessfully. The below sequence was repeated in the logs:

00000a9c.00000ad0::2006/10/10-11:51:59.500 [DM] DmJoin: getting new registry database
00000a9c.00000ad0::2006/10/10-11:52:07.343 [DM] Obtained new database.
00000a9c.00000ad0::2006/10/10-11:52:07.343 [DM] DmpRestartFlusher: Entry
00000a9c.00000af8::2006/10/10-11:52:07.343 [DM] DmpRegistryFlusher: restarting
00000a9c.00000ad0::2006/10/10-11:52:07.500 [DM] DmpSafeDatabaseCopy:: SetFileAttrib on BkpPath C:\WINNT\cluster\CLUSDB.BKP$ failed, Status=2
00000a9c.00000ad0::2006/10/10-11:52:15.453 [DM]: Loading cluster database from C:\WINNT\cluster\CLUSDB
00000a9c.00000ad0::2006/10/10-11:52:24.640 [GUM] GumEndJoinUpdate: attempting update type 1 context 4100 sequence 6922482
00000a9c.00000ad0::2006/10/10-11:52:24.640 [GUM] Join attempt for type 1 failed 5083
00000a9c.00000ad0::2006/10/10-11:52:24.640 [DM] GumEndJoinUpdate with sequence 6922482 failed with a sequence mismatch
00000a9c.00000ad0::2006/10/10-11:52:24.640 [GUM] GumpNodeCallback setting node 2 active.
00000a9c.00000ad0::2006/10/10-11:52:24.640 [GUM] GumBeginJoinUpdate succeeded with sequence 6922624
00000a9c.00000ad0::2006/10/10-11:52:24.640 [DM] DmJoin: getting new registry database
00000a9c.00000ad0::2006/10/10-11:52:32.046 [DM] Obtained new database.
00000a9c.00000ad0::2006/10/10-11:52:32.046 [DM] DmpRestartFlusher: Entry

Tried searching on the internet for some answers, but no direct resolution. But I saw that a few people seems to have seen similar events when printer drivers or printer registry caused the problem. They sounded related enough because this is after all a print server and I was configuring a new printer when it happened.

From MS KB:

————————————————————————————————-
00000c50.00000be8::2006/10/10-11:32:08.078 [GUM] Join attempt for type 1 failed 5083
————————————————————————————————-
Explanation:
Error code 5083 means “ERROR_CLUSTER_DATABASE_SEQMISMATCH”. The join operation failed because the cluster database sequence number has changed or is incompatible with the locker node.

This seems to tell me that the cluster hive in node 1 and node 2 are mismatched and hence node 1 could not load up its cluster service.

Resolution and steps taken:

The solutions that I found pointed to a possible corrupted CLUSDB file (this is the cluster registry hive). So I tried to restore from *.tmp in quorum (http://support.microsoft.com/?id=224999), but did not work.

I loaded CLUSDB of node 1 into the registry hive and it loads properly and don’t seem to be corrupted, but I am not convinced that its not due to CLUSDB.

Lastly, I decided to stop the cluster service on node 2 and copy its CLUSDB to node 1. Loaded it up, restarted node 2, waited and after awhile restarted node 1 and it comes up perfectly!

Diagnosis:

This is my diagnosis of what I thought had happened. When node 1’s spooler hung at the TCPIP port configuration dialog and was rebooted, there could be intermittant values in the cluster registry that caused the hive not to load up properly with the cluster service, even though I could load the CLUSDB hive manually. I didn’t think that there was anything I could have done before this to prevent this problem from happening.

Thus, copying the hive from node 2 to replace the corrupted from node 1 resolved the issue.

 
Leave a comment

Posted by on June 17, 2007 in Windows

 

Tags: ,

Setting up Windows clusters

One of the most often missed out configuration for a pair of clusters is the boot-up time out value in the boot.ini file.

Why is it important for cluster nodes to have a different boot-up timing? This is mainly to prevent situation when both clusters booted up at the same time (after say a power failure) and trying to lock the quorum and other shared disks. Resulting in a deadlock situation.

I always make it a standard to set 90 seconds for the second node as boot up time.

 
Leave a comment

Posted by on June 7, 2007 in Windows

 

Tags:

Windows cluster node boot up timing

Why is the node boot up timing important in Windows clusters?

Most of the time, this is not an issue. Especially when you have one node up all the time. But what happens when there is a power trip or someone shutdown both nodes and started them up together? If you are lucky, there are no problems with your cluster, but chances are there may be resource contention as both nodes starts at the same time and tries to claim ownership of shared storage and be the active node!

Thus, its a good practice to set the boot up of your 2nd node to be say 60 or 120 seconds later than the first node. This will allow node 1 to always come up first and claim the resources.

 
Leave a comment

Posted by on October 11, 2006 in Windows

 

Tags:

Windows cluster log size

The default cluster log size on Windows 2000/NT is 64 KB. However, if you have a lot of shares, this may cause some of the shares not to be listed. Microsoft recommend changing it to 4096 KB instead

MSKB: http://support.microsoft.com/kb/225081/EN-US/

 
Leave a comment

Posted by on October 11, 2006 in Windows

 

Tags: