Hi again all!  In this post I’ll be discussing a recent case we worked on with a customer.  They were experiencing stability issues on a newly upgraded 5 node 2016 cluster.  The nodes in the cluster would blue screen at random causing VM’s to cold boot on other nodes.


This was also the first time I had gotten to see RDMA in use with Mellanox networking gear.  It was a very impressive setup and seeing Live Migrations over RDMA at full speed once the cluster was stable was really, really impressive!


For troubleshooting and discovery we had outage window after hours as doing most anything in Failover Cluster Manger or Hyper-V Manager during working hours would cause the hosts to Blue Screen.  During the outage window while draining the first node we witnessed Live Migrations behaving oddly.  Primarily the fact that while watching network traffic egress out of the host server we were unable to see network counters increment or traffic to increase on the Live Migration networks that were specified in the cluster configuration and then prior to VM’s completing the Live Migration the node initiating the drain would BSOD.  As we had a consistent process for crashing the cluster this is where we began our troubleshooting.  Unfortunately Hyper-V and Cluster logs were not reporting any failures so we focused on the Mellanox switching infrastructure that we had found was logging RX Pause Packets.  While researching the reason this counter was incrementing we had discovered the Mellanox configuration guide followed for the 2016 installation.  At the initial reading of the article it appeared that for RDMA to function correctly in Server 2016 the NDKWithGlobalPause registry key would need to be configured on via a 1.  After additional research we discovered this setting for this environment should be disabled. This object controls whether or not the server sends out flow control xon/xoff packets on the Mellanox network for lossless communications.  Once disabled across two nodes for testing we again tested Live Migrations.


After the configuration changes we were able to see Live Migration traffic egressing via the Live Migration networks and successfully transferring to the target host.  We then applied these settings to all other hosts in the cluster and began draining nodes at random to test stability.  As Live Migrations continued to be successful we felt we had found the issue and then just enjoyed an hour or so of watching VM’s move around at the speed of RDMA!


So if you are an RDMA shop and looking to jump to Server 2016 just watch that regkey!  Also a great big thanks to Didier Van Hoye on his blog WorkingHardinIT.  He did a great job on his post below that helped us connect the dots on this one!






About the Author: Nick Taylor

Consultant – Model Technology Solutions Nick is an IT professional with more than 19 years of experience and a passion for learning about technology. His areas of expertise include datacenter, hypervisors, storage, network, cloud, and OSD. He spends his free time delving in crypto, video games, automation, IoT, and really anything nerdy.

Three Minutes For A More Secure & Efficient Infrastructure

Short and to the point, Steve’s Email Blasts give you endpoint management tips, tricks, and news in three minutes or less email read-time, guaranteed.

Model says no to spam. Privacy Policy

Model Technology Solutions

Model Technology Solutions is a small but mighty band of infrastructure experts. We’ve helped companies in diverse industries to modernize and automate their infrastructures through effectively managing their Microsoft endpoint suite.

With us on your team, you’ll watch your security and compliance go up and your IT team’s costs (and headaches) go down. You’ll relax in knowing that your endpoints will be secure and online when your users need them most. And you’ll finally get back to your most-important tasks.

Model Technology Solutions
12125 Woodcrest Executive Drive, Ste. 204 Creve Coeur, MO 63141

Phone: (314) 254-4138
General Inquiries: model@model-technology.com
Sales and Quotes: sales@model-technology.com