Recently, we need to use a feature in vROPS 6.2+ to dump alerts into a folder. One of the outbound settings in vROPS is “Log File Plugin”. When you configure it, you tell vROPS to create a text file for each alert in an output folder in the appliance. The recommendation from VMware is that this output folder should be an mounted volume pointing to an external source so that you don’t eat up all the disk space in the appliance. Read the rest of this entry »
Tag Archives: vmware
Updated this article after Peter’s comments and my own testing
During failover of an RP4VM consistency group (CG), you have a few choices of selecting the test network to failover your VMs in. Why is this necessary?
A totally interesting read on how the team in VMware resolved an issue with a non performing HP blade. The final take away from this is:
- Thermal paste really can impact performance!
- HP Active Health System logs should (but don’t) include when CPU’s clock down to prevent overheating.
- CPU clock throttled error message don’t appear in ESXi logs.
For the longest time VMware administrators have been using the C# VSphere clients or “fat client”. Over the years, it continues to plague us with hidden confirmation dialog boxes, error dialog boxes that pop-up continuously, version incompatibilities on the same computer and application crashes.
Worst of all are the need to test, package and supply new versions of the client whenever the infrastructure is upgrade. Its not a big deal if the only persons to use it are your own VMware administrators. But in big organisation with VDI in the mix, you need to hand off new versions to desktop support and even end users themselves. Worst of all, is the need for a clean uninstall of the older versions before the new versions can be installed or else you may end up with knackered computer which won’t run either versions. Read the rest of this entry »
Good news to some folks is that vCenter Server Appliance 6.0 is now on par with its former enterprise siblings, namely, vCenter on Windows + SQL or vCSA with Oracle.
For big enterprises, it is very common to install vCenter into a Windows machine and setting up the databases with your DBA admins on SQL. One of the main problem with this is the process of getting a database or expanding databases when you run of out of space. It can take time to setup a new vCenter or you often end up expanding the databases with an incident ticket. The other problem is that because the SQL database is managed by the DBA team, you are dependent on that team’s effectiveness to keep your vCenter running. There are times within busy periods of a DBA teams where you might find the vCenter service stopping or failing due to a stuck job or running out of space on SQL but your team was not alerted in time causing an incident.
Thus running your own vCSA, if you have a big enough cloud team, is probably a better solution in keep the lights up for your vCenter server, especially if they are consumed by VDI/Desktop teams and workflow applications like Orchestrator.
Of course, the challenge for Windows folks like me is having to master a new Linux OS and vPostgres database for optimization and troubleshooting. But these are alwasy good challenges to have.
I guess most of us have had the experience of getting an ESXi host into maintenance mode and finding that it got stuck with one last VM. When you look into host, you also see shutdown VMs and templates still hosted by this ESXi and you have to manually move them out before trying to get that last VM fixed so that you can get into the maintenance mode.
Great news with the vSphere 6.0 Update 2, now the order of evacuation for host going into maintenance mode is improved! Specifically:
Starting with vSphere 6.0 Update 2, when a host enters Maintenance Mode the host will evacuate all the powered off VMs and templates first, and then proceed to vMotion all the powered on VMs.
For many who have moved on to vSphere 5.1 and beyond, and despite VMware’s focus on web-based management instead of the fat client, we still had to use the fat client to connect to and managed ESXi hosts. Great news as now there is a web based client to manage ESXi host directly. You do have have to install a VIB to your host for this to happen (I am sure this will be integrated in later versions) and it only works on 5.5 U3 and above, which will be released later.
Quite a while ago, we had a issue with some newly deployed clusters. The NFS datastores were going inactive and recovering randomly on different ESXi 5.0 hosts. What happens is that randomly, we will find one or two or more ESXi host with some of their datastores in inactive state. After a while, maybe a few minutes later, it will recover. The vmkernel logs shows the ESXi hosts losing connection to the datastores and the restoring connection later. Additional to this some datastore cannot be written to after being mounted, even though the permissions were definitely set correctly and we compared them to the working ones. Lastly, there were no error logged on the Netapp filer, which host the NFS shares.
As it turns out the issue was caused by a misconfigure MTU and QOS settings on the network switches on our UCS chassis! All out newly deployed clusters had jumbo frame configured on the IP storage network, but jumbo frame was not setup up correctly on the hardware site, i.e. the UCS site. Vmkping using size of 8000 to your storage IP will quickly uncover this issue.
vmkping -s 8000 <your storage IP>
In hindsight, the symptoms of NFS datastores going inactive and back should quickly tell you that network is probably the problem
The most challenging part of having a blade enclosure system like HP c7000 blade encloure series is the getting the firmware to match the components. For example, you may start off with BL460c G7 blades and 1 year down the road decide to add Gen8 blades into the same enclosure. It is not a simple question of just plugging them in, many times it will not work as the underlyin OA firmware may not support Gen8 blades. However, one cannot just go ahead and upgrade the OA firmware without first checking the firmware versions of the G7 blades and iLOs to ensure that the new firmware is supported by each other. This is usually not too big a problem when the blades are new and estate is small, but if you have them deployed globally and over a few years, you can be assured that firmware versions will be very varied. And any attempts to standardize just say the OA firmware can difficult.
This is why HP has a compatibility matrix for its system. It used be a bit more complex (but easier) as the table would state the minimum firmware version for each component to work with each other. So you may want to upgrade the OA firmware to 3 versions higher but keep the rest the same, it would be not an issue. However, they have since streamlined this and force everyone to upgrade to a single version level. So if you want to upgrade the OA firmware to 3 version upwards, you need to upgrade all other components to the same version base.
Now if you are runnning ESXi hosts on these blades, your have to consider recommended driver versions which works in tandem with the OS version and the HP blade firmware.
We are currently using 3 Syslog collector to serve each region (i.e. AMER, EMEA & APAC) running version 5 u2 on Windows 2008 R2 64-bit machines. I had previously posted a question to VMware regarding maximum number of ESXi hosts that can report to it and they categorically said that there are no limits for the service. Technically this is correct, however, practically, the limitation comes from the OS which you run the service one and limits which the OS has per process which limits how many hosts can and should report to it. Recently, we found logs not updated or log folder for the ESXi host empty on our Syslog collectors. Network monitor between the ESXi hosts and Syslog server shows the hosts are continuously sending data via UDP port 514 (this is our setup). So the problem had to be on the Syslog server end.
Upon checking the debug log, we found a lot of these errors: Exceptions.IOError: [Errno 24] Too man open files: u’E:\\syslog\\188.8.131.52\\syslog.log’
So obviously the OS or the process cannot work with too many open files at one time. The 2 Syslog servers which we had had 700 and almost 1,400 hosts pointing to it. Another which was working fine had about 400 hosts. So it looks like the limit is between 400 and below 700 hosts. I restarted the service and waiting for the error to appear on the debug log then counted the number of Syslog folders that had been updated.
Consistently, the number is around 600. Kicking off process explorer (from MS) and looking at the handles for the syslogcollector.exe process, I can see that the handles sort of max out at 700 handles for both of the troubled Syslog servers. On the good one, the handles are about 600. So I will say that there is a limit in the number of hosts your can configure for the Syslog collector, but the limit is due to how the OS works. So to be on the safe side, I will say to limit the number of hosts a Syslog collector works on to about 400 – 500 hosts. This should stress it just right without breaking it apart, however, please do check the debug logs to tune your figures.