We had an interesting week; started the week with a lot of calls from the desktop team complaining that many of the workstations which they had either rebooted over the weekend or powered up could not be powered up. These are virtual machines running in our vSphere environment. In vCenter we can see that the VMs were showing up as “invalid” and we found that they all had the VMX file as zero bytes. Initially, they all appeared random, but we soon narrowed them down to 2 particular clusters, that is, those issues occurred in only these 2 clusters.
A bit of explanation on our setup: These are UCS blades built as ESXi 4.1 hosts. There are 16 hosts in each cluster and DRS and HA are enabled for all clusters (of course). The hosts have NFS datastores using Netapp filers. These clusters host our VDI environment which is managed via XenDesktop and we have a monthly reboot and patching cycle for all desktops in our firm.
After some further investigation, we found that the issue seem to occur when DRS kicks in and vmotions VMs from one host to another. Sometimes it works but sometimes the VM becomes invalid after the vmotion attempt. VMware support came back pointing finger at NFS storage related problem. For some reason, in some cases, when vmotion occurs the host loses access to the VM’s files in the NFS datastore and gets permission denied.
Storage team is quite adamant that nothing had changed during the weekend. So after almost one week of digging we found out that the issue was caused by some of the ESXi hosts losing its static route to the Netapp filer. We verified and found that this issue only occurred by vmotion occurs between two hosts which the static route are lost and we had no issues if it was between one faulty one and a working one.
Now, each NFS datastore points to a volume in the Netapp filer and each volume grants read/write permissions to a particular subnet. All hosts in a cluster have a pair of teamed NIC configured with subnet (let’s call this the NFS subnet) which is granted read/write permissions to those volumes. Static route is establish to ensure that NFS traffic route through this subnet. If calls to NFS datastore occurs via any other subnet of that host, the volumes will become read-only, which explains why our problem occurred.
- Some ESXi hosts lost their static route to the Netapp filer as a result NFS datastore will appear as read-only to them
- When DRS kicks in to vmotion powering up VMs, vmotion between 2 faulty hosts causes lost of access to the VMX files and as a result it became zero-byte.
We are still trying to find out why perfectly health hosts will lose their static route.