The problem started when I configured a new printer and was looking at the configuration of a TCPIP port via one of the virtual nodes (let’s called it 01A) and it hung. I tried to kill the spooler process and restarted it, but it would be at “Start pending” mode within cluster admin. So I force killed the spooler service and failed the resources over to node 2 and had node 1 rebooted.
When node 1 came up again, the cluster service was in “starting” mode and will not join the cluster.
When I looked at the cluster.log file, I could see that the cluster had tried to load its cluster hive unsuccessfully. The below sequence was repeated in the logs:
00000a9c.00000ad0::2006/10/10-11:51:59.500 [DM] DmJoin: getting new registry database
00000a9c.00000ad0::2006/10/10-11:52:07.343 [DM] Obtained new database.
00000a9c.00000ad0::2006/10/10-11:52:07.343 [DM] DmpRestartFlusher: Entry
00000a9c.00000af8::2006/10/10-11:52:07.343 [DM] DmpRegistryFlusher: restarting
00000a9c.00000ad0::2006/10/10-11:52:07.500 [DM] DmpSafeDatabaseCopy:: SetFileAttrib on BkpPath C:\WINNT\cluster\CLUSDB.BKP$ failed, Status=2
00000a9c.00000ad0::2006/10/10-11:52:15.453 [DM]: Loading cluster database from C:\WINNT\cluster\CLUSDB
00000a9c.00000ad0::2006/10/10-11:52:24.640 [GUM] GumEndJoinUpdate: attempting update type 1 context 4100 sequence 6922482
00000a9c.00000ad0::2006/10/10-11:52:24.640 [GUM] Join attempt for type 1 failed 5083
00000a9c.00000ad0::2006/10/10-11:52:24.640 [DM] GumEndJoinUpdate with sequence 6922482 failed with a sequence mismatch
00000a9c.00000ad0::2006/10/10-11:52:24.640 [GUM] GumpNodeCallback setting node 2 active.
00000a9c.00000ad0::2006/10/10-11:52:24.640 [GUM] GumBeginJoinUpdate succeeded with sequence 6922624
00000a9c.00000ad0::2006/10/10-11:52:24.640 [DM] DmJoin: getting new registry database
00000a9c.00000ad0::2006/10/10-11:52:32.046 [DM] Obtained new database.
00000a9c.00000ad0::2006/10/10-11:52:32.046 [DM] DmpRestartFlusher: Entry
Tried searching on the internet for some answers, but no direct resolution. But I saw that a few people seems to have seen similar events when printer drivers or printer registry caused the problem. They sounded related enough because this is after all a print server and I was configuring a new printer when it happened.
From MS KB:
00000c50.00000be8::2006/10/10-11:32:08.078 [GUM] Join attempt for type 1 failed 5083
Error code 5083 means “ERROR_CLUSTER_DATABASE_SEQMISMATCH”. The join operation failed because the cluster database sequence number has changed or is incompatible with the locker node.
This seems to tell me that the cluster hive in node 1 and node 2 are mismatched and hence node 1 could not load up its cluster service.
Resolution and steps taken:
The solutions that I found pointed to a possible corrupted CLUSDB file (this is the cluster registry hive). So I tried to restore from *.tmp in quorum (http://support.microsoft.com/?id=224999), but did not work.
I loaded CLUSDB of node 1 into the registry hive and it loads properly and don’t seem to be corrupted, but I am not convinced that its not due to CLUSDB.
Lastly, I decided to stop the cluster service on node 2 and copy its CLUSDB to node 1. Loaded it up, restarted node 2, waited and after awhile restarted node 1 and it comes up perfectly!
This is my diagnosis of what I thought had happened. When node 1’s spooler hung at the TCPIP port configuration dialog and was rebooted, there could be intermittant values in the cluster registry that caused the hive not to load up properly with the cluster service, even though I could load the CLUSDB hive manually. I didn’t think that there was anything I could have done before this to prevent this problem from happening.
Thus, copying the hive from node 2 to replace the corrupted from node 1 resolved the issue.