The Case of the Crashing Hyper-V Nodes
Hi again all! In this post I’ll be discussing a recent case we worked on with a customer. They were experiencing stability issues on a newly upgraded 5 node 2016 cluster. The nodes in the cluster would blue screen at random causing VM’s to cold boot on other nodes.
This was also the first time I had gotten to see RDMA in use with Mellanox networking gear. It was a very impressive setup and seeing Live Migrations over RDMA at full speed once the cluster was stable was really, really impressive!
For troubleshooting and discovery we had outage window after hours as doing most anything in Failover Cluster Manger or Hyper-V Manager during working hours would cause the hosts to Blue Screen. During the outage window while draining the first node we witnessed Live Migrations behaving oddly. Primarily the fact that while watching network traffic egress out of the host server we were unable to see network counters increment or traffic to increase on the Live Migration networks that were specified in the cluster configuration and then prior to VM’s completing the Live Migration the node initiating the drain would BSOD. As we had a consistent process for crashing the cluster this is where we began our troubleshooting. Unfortunately Hyper-V and Cluster logs were not reporting any failures so we focused on the Mellanox switching infrastructure that we had found was logging RX Pause Packets. While researching the reason this counter was incrementing we had discovered the Mellanox configuration guide followed for the 2016 installation. At the initial reading of the article it appeared that for RDMA to function correctly in Server 2016 the NDKWithGlobalPause registry key would need to be configured on via a 1. After additional research we discovered this setting for this environment should be disabled. This object controls whether or not the server sends out flow control xon/xoff packets on the Mellanox network for lossless communications. Once disabled across two nodes for testing we again tested Live Migrations.
After the configuration changes we were able to see Live Migration traffic egressing via the Live Migration networks and successfully transferring to the target host. We then applied these settings to all other hosts in the cluster and began draining nodes at random to test stability. As Live Migrations continued to be successful we felt we had found the issue and then just enjoyed an hour or so of watching VM’s move around at the speed of RDMA!
So if you are an RDMA shop and looking to jump to Server 2016 just watch that regkey! Also a great big thanks to Didier Van Hoye on his blog WorkingHardinIT. He did a great job on his post below that helped us connect the dots on this one!