Hi all,
For my first blog post over here at Model I figured I’d pass on an interesting issue I ran across a few weeks ago. I was investigating what was presented as an unhealthy 4 node Hyper-V cluster. The symptoms were that 2 of the nodes would randomly go into a paused state. If any workloads were running on those nodes during this event, they would be gracefully drained off to the other two healthy nodes.
Inside of Failover Cluster Manger you could see that the nodes indeed were paused
For testing I began by resuming operations on the nodes without failing back any roles. I was hoping to catch the pause in action and then be able to crawl the cluster logs for the culprit. So I resumed the forth node in the cluster and began taking a peak at the cluster logs
I found no errors in the last few days and continued to dig around when I suddenly noticed about 10 minutes after the resume that the cluster was back in a paused state. I assumed on the refresh of the cluster logs I’d see a nice trail of events and have a great start on tracking down the root cause.
I had high hopes for the cluster logs but after looking at them they showed clean along with the Windows System, Application, Hyper-V, Failover Cluster, and Virtual Machine Manager log.
After a few hours of searching I was still no closer to an answer, but since there was a lack of errors I had the idea that maybe this wasn’t an error but expected behavior. So what could be causing a cluster node to pause? I then started thinking about Virtual Machine Manager and the Performance Resource Optimization functionality.
PRO was enabled along with Dynamic Optimization, so I went ahead and disabled these options for the cluster and then resumed one of the paused nodes to test, pretty confident I’d found the issue.
10 minutes later it was back to paused! This time I did notice something interesting on the last job status in VMM.
So with that being logged I was then able to crawl back through the VMM logs and confirm that on both nodes this was not an error of any type but VMM placing the nodes into a paused state.
After I disabled all PRO settings in VMM across all clusters, Dynamic Optimization on all clusters, and finally removed and then added the cluster back into VMM, I was able to get the nodes to stop pausing at random intervals. So the moral of this story is be in control of your automation and not let your automation control you. Proper health checks to determine if that automation is correctly working could be the difference of it being a time saver or turning into a time vampire like in this case.