We continue to be a very satisfied customer and partner of DataGravity. Just this week, we were saved in a big way by our DataGravity Discovery Series storage array. First off, I’ll provide a little context around the issue.
The Issue: Virtual Machines Offline
We upgraded our production environment to the latest build of VMware vSphere 6 on Sunday. All apparently went well; but on Monday morning, we received alerts that a couple of virtual machines from one of our hosts were offline. They were taxing the host with 100% CPU usage and were completely unresponsive. We wound up having to power them off, but we could not power them back online. One of the offline and unavailable virtual machines was our vCenter instance. VMware support was less than helpful, and that host was in a very bad spot while on the road to just getting worse.
Recovery
Attempted quick recovery options:
- Pull the failed-to-start VMs to a different host, clean up the registrations and bring them online. This worked for some, but not vCenter
- Use the published harsh kill methods for the VPX processes. This wouldn’t work. Kill reported the PID was invalid.
- vMotion running VMs off of the affected host, then power cycle it. Trouble is, we cannot vMotion without vCenter.
Attempted longer recovery options:
- Copy the vCenter machine to a different storage platform. VMware support started this, but it was going far too slow
- Utilize Veeam to run an instant recovery on the vCenter VM, then rerun the v6 upgrade. The issue here is that Veeam quickly required vCenter to be online. Uh oh?
- Rebuild vCenter as a fresh VM. We’d be starting from scratch and lose far too much in terms of customization.
All of these longer options hit a wall pretty quickly and weren’t very practical. We had some critical infrastructure pieces offline, and we needed much more timely results. We couldn’t afford to take down another set of 15 virtual machines to reboot the host during business hours.
DataGravity to the Rescue!
There’s an excellent feature within the Discovery Series called Discovery Points (like snapshots). They are implemented at an individual, virtual machine level. As the original vCenter VM was offline and locked from starting due to a host issue, I found their VM Clone option from within the Discovery Point screen to be the winning solution.
In the image above, clicking Clone VM in an instant had a fresh folder within my existing datastore. I ran a simple operation to browse the datastore and added the VMX file into inventory. I gave the VM a new name, and “BAM!” We were back in business! We have a little bit of a storage vMotion to run in order to take care of this and get files back where they need to be; but this allowed us to get vCenter online, vMotion the remaining VMs on the failing host, and get that host rebooted and back online in a healthy manner.
All in all, we talk plenty about the insight and analytics that the DataGravity Discovery Series can provide; but from a system administration perspective, there is a ton of value delivered by it, as well.