One of the frequent services we find ourselves performing for clients is running a health check on their Operations Manager environments, helping clients get the most from their investments by identifying any issues before they cause problems. After all, when we’re discussing the infrastructure used to alert anytime something else in the environment has a problem, it’s critically important to ensure monitoring is both thorough and resilient such that it will be able to alert of problems regardless of their source. As such, one of the first things I check is the fault tolerance of the environment – especially including the agent failover configuration.
I find that the failover configuration for SCOM agents is one of the most frequently overlooked settings in SCOM, despite how critical it is to maintaining fault tolerance. Should a management or gateway server go down, it is important to know that the agents will redirect communication to another server and keep talking. While this does work, it only works if the agents have additional servers listed specifically as Failover Servers in their configuration.
Theoretically, when an agent is installed and has a management server (opposed to a gateway server) set as their Primary Management Server, they should automatically have all other management servers added to their configuration as Failover servers. Unfortunately, that configuration does not always occur. Additionally, if an agent has a gateway server set as their Primary Management Server upon installation, the agent assumes that no other gateways are available and no failover configuration is automatically performed. Given the irregularity of automatic agent failover configuration and the fact that the SCOM Operations Console does not expose any information about a given agent’s Failover servers, it is easy to overlook situations where failover has not been configured.
As I’m addicted to finding ways for PowerShell to help improve my life, I wrote a script to collect the agent failover configuration for all agents. The script exports the data into a CSV which I can easily review, sort, filter, and manipulate in Excel to quickly assess the state of an environment’s agent failover configuration and recommend action. The full script is attached at the end of this post.
Once the agent failover configuration is known, it is easy to take action. Agent failover can be manually configured via PowerShell and the Set-SCOMParentManagementServer cmdlet, quickly increasing the SCOM environment’s fault tolerance. For an automated solution, Tao Yang’s excellent Operations Manager Self Maintenance Management Pack contains a workflow which will automatically configure agent failover for all agents reporting to management servers within a specified resource pool. If you have a larger environment and some MP XML editing skills (or a talented friend), it’s not terribly difficult to duplicate the rule in a custom management pack and enable multiple resource pools for automated agent failover configuration. (I have actually already done this; perhaps I’ll share the management pack eventually, with great respect and credit to Tao for the basis.)
Download the script from the link below. If you run it from a management server, no inputs are required, though it accepts parameters for SCOM management server name, output report file name, and output location. If you have any questions, post them in the comments below. Until next time, happy scripting!
Click to download Generate_AgentFailoverReportCSV.zip