Recently we found a server (HP Proliant DL580G2) rebooting itself almost everyday because of BSOD. My first instinct was to get the server patched with all the latest patched. I especially thought that the issue is SATA or SCSI drivers, as we have seen other Windows 2003 servers having similar issues with storport drivers, HP Proliant server running storport crashing, and it resolved via patching the HP drivers and updating W2k3 with the hotfixes.
This was also the original diagnosis and recommendation from Microsoft support also:
1. The stop code is 0x000000C2, which indicates that the current thread is making a bad pool request:
1: kd> .bugcheck
Bugcheck code 000000C2
Arguments 00000007 0000121a 00000800 e9a15d90
2. The 4th parameter is the pool block address got corrupted:
*e9a15d78 size: 288 previous size: 10 (Allocated) *Toke (Protected)
1: kd> dc e9a15d90
e9a15d90 00000000 00000000 00000000 00000001 …………….
e9a15da0 00000000 00000000 bad0b0b0 82100000 …………….
e9a15db0 00000000 00000000 61766441 20206970 ……..Advapi
3. The previous pool block should reach e9a15d78+288=e9a16000:
1: kd> dc e9a16000
e9a16000 e8000000 fffffc84 c01bd8f7 00074992 ………….I..
e9a16010 0054004e 0073005c 00730079 00650074 N.T.\.s.y.s.t.e.
e9a16020 0033006d 005c0032 006c0064 0063006c m.3.2.\.d.l.l.c.
4. Search in the memory and found the address could be referenced by pool tagged with $CPH, which should be owned by CPQPHP:
1: kd> !for_each_module s-a @#Base @#End “$CPH”
f34d8869 24 43 50 48 ff 74 24 08-6a 01 ff 15 48 01 4d f3 $CPH.t$.j…H.M.
1: kd> u f34d8869
*** ERROR: Module load completed but symbols could not be loaded for CPQPHP.SYS
f34d8869 2443 and al,43h
I got in contact with the server owner and schedule an upgrade, but as I was doing the upgrade, the server just kept crashing. This is a bit strange, as it is not running anything much at that point in time.
After much trouble, we decided to rebuild the entire server from scratch. Strangely enough, even during the rebuild process, the server was crashing with BSOD code of 0x50. After we complete the rebuild and patched the server to all the latested hotfixes and drivers we can find, it continued to BSOD! Here is the snaphot from HP IML within 3 days….
- Blue Screen Trap (BugCheck, STOP: 0x00000050 (0xE7971511, 0x00000000, 0xB7BE6B28, 0x00000002)) 1/14/2008 1:26PM 1/14/2008 6:08PM
- Blue Screen Trap (BugCheck, STOP: 0x00000050 (0xECDA418C, 0x00000001, 0xBF8AC995, 0x00000002)) 1/14/2008 11:18AM 1/14/2008 6:08PM
- ASR Detected by System ROM 1/13/2008 3:09AM 1/14/2008 10:34AM
- Blue Screen Trap (BugCheck, STOP: 0x0000000A (0x00000004, 0xD0000002, 0x00000001, 0x80A5A56E)) 1/11/2008 5:12PM 1/14/2008 6:08PM
- Blue Screen Trap (BugCheck, STOP: 0x00000050 (0xE52872DC, 0x00000000, 0xB990A12A, 0x00000002)) 1/11/2008 2:19PM 1/14/2008 6:08PM
- Blue Screen Trap (BugCheck, STOP: 0x00000020 (0x00000000, 0x00000001, 0x00000000, 0x00000001)) 1/11/2008 1:30PM 1/14/2008 6:08PM
Now at the point, I couldn’t fault any drivers anymore, because this is now a vanilla build and no other servers of this type of build were BSODing like this.
Attention is now focused on 3 main components of the hardware:
- Array controller
We raised the same issue with Microsoft with the 0xA error and other minidumps and they came back with this following:
By checking the Memory dump it shows a problem which cannot be explained from software level. There is one bit flip in data saved on stack.
Usually, this is not a software level behavior as it is unusual to see driver or application tries to modify only one bit in the kernel data especially in stack, as data in stack is private to function being called, which means the other thread will not be aware of the what kind of data saved in stack and also not able to touch them directly.
This definitely sounds like something a fautly memory or cache would do.
We tried to get HP support engineers to help us out with the problem, but they hardly made an impact to our troubles. The hardware diagnostics on the server did not turn up any abnormalies, which we already expected it to. Also as no hardware were showing errors, they were very reluctant to do any swap-outs for us, we had to talk to our HP account manager to get things sorted out.
Since our first suspicion was the array controller, we had the motherboard swapped out. Powered up and server and waited… before long, its BSODing again!
Next on the list were the RAMs, swapped it out, powered up and waited… again…. and BSODing continues.
Now we were at our wits end wonder if it could be the CPUs or the hard disks, but generally issues from CPU and hard disks don’t manifest that type of errors as seen in the 0xA diagnosis.
Then one of my colleague suggested that we changed the cache module on the array controller. It is a separate daughterboard on the mainboard and so we had it swapped out.
Now that solved the problem! The server had not BSOD for over 5 days when it would be doing so every day at least twice before the swap out.
There we have it, a tiny cache module causing us 2-3 weeks of troubleshooting and countless visits to the datacenter.
In hind sight, the error exhibited by the 0xA error could not have been the main memory, after all, these have error correction and it would be highlight if any of the RAMs had an issue. But then I have seen weird issues with server where undetected faulty RAMs were the cause of the problem, but that was a few years ago…
The issue was indeed related to the array controller and so most of our diagnosis was somewhat pointing to the right culprit, just the wrong module!