MSCluster: Low free PTEs caused cluster service to disconnect

30 Jun

Back in the days when a lot of us are not sure what the 3GB switch really does and thought its must be set so that Windows can recognise 4GB RAM and above, a number of our application servers has been set with 3GB switches in their servers. This is one of the servers.

The other day one of the application cluster suddenly failed over. A quick check on the servers’ eventlogs show not issues with low non-paged pool memory or memory issues nor any network issues. The application logs was rather clean, other than a strange repetitive event from the MOM agent. We have a keep alive event which the MOM agent runs once a day, but I was seeing the same event running 2 twice every minute. On another server when had the same program we could see that the idling node had cpu running at 20% or more. Once we stopped the MOM agent, the cpu dropped to almost idle.

So I decided to see if the MOM agent could have cause this issue. The first thing that I found was that this cluster had 3GB set in their boot.ini. Another cluster which faced the same issue with our MOM agent did not disconnect, it was just high in CPU utilisation and it did not have the 3GB switch.

This made me suspect that the issue could be related to the 3GB switch and the MOM agent’s behavior could have made it worst. Note that the server was no more failing over as a restart of the MOM agent put it back to normal operations, so I did not have a bad server to diagnosis but a working one that failed just now.

In general 3GB switch is only used if your application specifically needs that extra 1GB of user space. I believe this is a requirement for some Exchange and MSSQL servers. Even then one is recommend to use the 3GB switch in conjunction with the USERVA switch on Windows 2003. For Windows 2000, the same effect is achieved via the SystemPages registry setting.

The problem with turning on 3Gb switch is that it reduces the kernel resources (i.e drivers, system page, paged pool, etc) to 1 GB, allocating 3 GB to user, whereas normally this would be 2GB/2GB. The kernel resources is important to keep the basic OS components, i.e network, paging, cluster, running correctly.

The effect of using 3GB switch means that free PTEs (Page Table Entries) are reduced from 106,000 to only 15,000 (on a Windows 2000 with 4GB RAM). Low PTEs means its very easy for an OS to become unstable if there are more demand place on it either over time or on demand. When system faced low PTE, they could stop drivers from working like disk drivers or network drivers or run out of page pool.

From: Detection, Analysis, and Corrective Actions for Low Page Table Entry Issues

The PTE pool size is automatically determined at system startup based on the amount of physical memory in the system. This pool is squeezed in between paged pool and nonpaged pool, which also grows with the amount of physical memory in the system.

The system PTE pool can become heavily used and heavily fragmented. This could lead to situations where a driver might not load. Also, if the system PTE pool is depleted completely, other parts of the system will degrade, even resulting in threads not being created, system stalls, and potential system crashes.

From the cluster log of the server I could see 3 errors at the point of time the cluster failed over.:

0000145c.000012a8::2008/06/23-01:17:36.924 Network Name : Unable to read resource data parameter, error=2
0000145c.000012a8::2008/06/23-01:17:36.924 Network Name : Unable to read creating DC parameter, error=2

0000145c.000012a8::2008/06/23-01:17:36.924 Network Name : Successful open of resid 657744

This points to a network resource on the cluster being unavailable and it’s a symptom of low PTEs.

From the exchange troubleshooting KB, Microsoft recommends free PTE of above 8,000 with 5,000 as a warning threshold, so I can safely assume that this applies to other servers also. So I ran a perfmon check with the Free PTE counter and found that the average free PTEs of the cluster was averaging around 6,500, which is just below the recommendation.

So its not hard to imagine that when our MOM agent went bonkers, the addition resource on the server caused the free PTEs to drop so such a low level the the network resource started to bail out, causing the cluster to failover.

I recommended to the application team to remove the 3GB switch if they don’t need it, which they did.

Leave a comment

Posted by on June 30, 2008 in Windows



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: