VMware: NFS datastores randomly going inactive and active

19 Mar

Quite a while ago, we had a issue with some newly deployed clusters. The NFS datastores were going inactive and recovering randomly on different ESXi 5.0 hosts. What happens is that randomly, we will find one or two or more ESXi host with some of their datastores in inactive state. After a while, maybe a few minutes later, it will recover. The vmkernel logs shows the ESXi hosts losing connection to the datastores and the restoring connection later. Additional to this some datastore cannot be written to after being mounted, even though the permissions were definitely set correctly and we compared them to the working ones. Lastly, there were no error logged on the Netapp filer, which host the NFS shares.

As it turns out the issue was caused by a misconfigure MTU and QOS settings on the network switches on our UCS chassis! All out newly deployed clusters had jumbo frame configured on the IP storage network, but jumbo frame was not setup up correctly on the hardware site, i.e. the UCS site. Vmkping using size of 8000 to your storage IP will quickly uncover this issue.

vmkping -s 8000 <your storage IP>

In hindsight, the symptoms of NFS datastores going inactive and back should quickly tell you that network is probably the problem

