One thing that I always remind myself of is to be opened to all possibilities, especially when troubleshooting technical issues. This usually works well, but sometimes, when we have already decided on the result or the reason for a problem, it can blind you from thinking otherwise.
The other day at the data center in Japan, one of the HP Proliant servers had a problem and we kept getting the message “hardware error”. So my immediate assumption was that something must be wrong with one of the hardware parts of the server. So in comes the HP engineer to look at the problem. My initial suspect is either the system board or the SCSI board. So I thought that this would be a quick and easy job, 1 hour, everything replaced and we go home.
But NO!! The engineer replaced the system board; it did not work. The SCSI board; did not work. Then we waiting for another 3 more hours for some CPU and memory parts to come in. Replaced the CPU; did not work. Replaced the memory, still did not work! Argghhhhh… it already 8 pm! The good news is that the server was not in production, so we took a break after like 6 hours working on it. Went home and rested.
The next day, a new engineer came in and I expected them to bring in more parts. So I was a little pissed when he did not carry anything with him and I said to myself: “Oh no, what can he do without parts, not another late night!?”
So already I’ve made a decision that this is stupid engineer and was a little bit on the edge (only internally), just waiting to see what miracle he could perform. He ask for a Windows 2003 CD, which we did not have. So I showed him where I got the error and using the exact same hard disks, I inserted it into another machine of the same specs and showed him that it booted up fine. Then I keep repeating (in broken Japanese) that we did this sooooooo many times with all other servers and they worked and only THAT machine did not work. I was trying to impress onto him that he need to CHANGE the whole bloody machine and save my time.
Then he did some that caught me off guard and suddenly and immediately I knew what was most likely the problem!
He took out the PCI network card and the 2 HBA (fiber channel) cards and place them on the floor. Almost immediatly, I knew that the problem must be due to one of the HBA cards and the symptoms makes sense all of a sudden. True enough, it was one of the HBA cards… and it has nothing to do with server, its our own stuff.
I could almost kill myself with stupidity because this is one of the worst sins one can commit in troubleshooting: “Not breaking down and identifying all the variables and eliminate them one by one!” My assumption that the server had to be faulty was so strong that I forgot that other variables, like the cards or even cables could be a source of problem.
Boy did I do a lot of “domo arigato gazaimasu” with him as he was leaving.