In my current Operations Manager engagement, we came across an issue with Operations Manager failing to discover new SQL databases on a previously-monitored instance. The root cause was unexpected (and in fact, unrelated to Operations Manager) and ripe for sharing, in case anyone else comes across this issue in the future. So here goes!
We first noticed the problem after several databases had been migrated from a SQL instance being decommissioned to an instance hosted on the client’s core SQL cluster. The cluster is a 6-node beast with thousands of gigabytes of RAM spread across the nodes, supporting numerous clustered instances. We were already monitoring the cluster nodes and SQL instances within Operations Manager, so the expectation was that the next time Operations Manager’s database discovery workflow ran, the migrated databases would be discovered as new databases on their instance and would appear within Operations Manager. However, 24 hours later, the databases still hadn’t been found. We started off with performing the usual troubleshooting – confirming Agent Proxy was still enabled on the node, triggering the discovery on demand, clearing the health service state cache to reload configuration, all with no success. We did find one clue to work from, however – in the Event Log on the node hosting the instance, we found several Warning events with the following text:
The process started at <time> failed to create System.Discovery.Data, no errors detected in the output. The process exited with 4294967295
Command executed: “C:\Windows\system32\cscript.exe” /nologo “DiscoverSQL2008DB.vbs” {<discovery rule guid>} {<target instance guid>} <servername> <SQL connection string> <SQLServer\SQLInstance> “Exclude:”
Working Directory: C:\Program Files\Microsoft Monitoring Agent\Agent\Health Service State\Monitoring Host Temporary Files 3\963\One or more workflows were affected by this.
Workflow name: Microsoft.SQLServer.2008.DBDiscovery
Instance name: <servername.domain.tld>
Instance ID: {<guid>}
Management group: <management group name>
Here we had evidence that something was preventing the database discovery from functioning, though it was not immediately clear what that something was. The next step, then, was to troubleshoot the database discovery.
We searched through the health service state cache on the cluster node and found the discovery script, then copied it elsewhere on the server (in order to prevent the Operations Manager agent’s health service state grooming procedures from deleting it while we were testing). Running the script manually in an administrative Command Prompt, using the same arguments listed in the event log, produced the same result – a blank output. No errors, no discovery data, just…nothing. The problem, therefore, couldn’t have been in the local copy of the script, but perhaps in the arguments supplied to it.
Next, we opened up the SQL 2008 Discovery management pack and took a look at the data source for the database discovery workflow. Here we came across our second clue – the data source module invoking the discovery script was configured to add the following arguments:
$MPElement$ $Target/Id$ $Config/ComputerID$ $Config/ComputerName$ $Config/SQLConnectionString$ $Config/SQLInstanceName$ “Exclude:$Config/ExcludeList$” $Target/Property[Type=”SQL!Microsoft.SQLServer.DBEngine”]/TcpPort$
The clue that jumped out at us was that the number of arguments supplied by the module numbered 8, while our event log message only contained 7; the SQL instance’s TCP port was not being passed into the discovery. Looking inside the discovery script, we found that “OPTION EXPLICIT” was stated, meaning that if not all required arguments were passed in, the script wouldn’t run. Sure enough, here was the cause of the discovery failure – the SQL instance’s TCP port was not being passed to the discovery, resulting in the discovery script producing blank output rather than the dataitems containing database information.
Now that we’d determined the issue with the database discovery, we were left with a new problem – why was the TCP port not being passed into the script arguments? The data source was configured to pull the TCP port value directly from the SQL instance’s properties, so the next place to look was at the discovered data for the SQL instance within Operations Manager. When we pulled up the instance’s Properties, here’s what we found:
Note the yellow highlighted area – the TCP port property was blank. A blank property explains why no data was being passed into the database discovery; there was no data to pass in. It seemed the real problem was a failure during the SQL instance’s discovery, though that issue was manifesting as a database discovery failure. We needed to test that in order to confirm the hypothesis, though, so my next step was to open up SQL Server Configuration Manager on the cluster node to look up the instance’s TCP port value. My intent was to repeat the test above, running the SQL database discovery script manually with the arguments supplied by the event log warning, adding in the TCP port value. If the lack of TCP port was truly the cause, then running the script like this should return a collection of dataitems. However, when I opened SQL Server Configuration Manager, I was met with this error:
There appeared to be an issue with the WMI provider for SQL. This seemed like the sort of issue that could prevent Operations Manager from discovering the TCP port information successfully. More so, it was immediately preventing me from finding the data and testing my hypothesis. Therefore, it was a problem, and I needed to solve it.
A quick search on the issue turned up this KB article on Microsoft’s support site: Error message when you open SQL Server Configuration Manager in SQL Server: “Cannot connect to WMI provider. You do not have permission or the server is unreachable”. That article specifically identifies the source of the problem as being caused by the uninstallation of a SQL instance from a server with both 32 and 64 bit versions of SQL installed. We were unable to confirm with the local SQL administrators whether or not that had actually occurred on that cluster node recently, however we were able to implement the fix listed.
Per the KB article, we recompiled the MOF files for the installed version of SQL with the following command:
mofcomp “%programfiles(x86)%\Microsoft SQL Server\100\Shared\sqlmgmproviderxpsp2up.mof”
(Note that the version number above reflects SQL 2008 R2; if a different version of SQL is installed, the folder path will be different.)
After recompiling the MOF files, we restarted the Windows Management Instrumentation service and tried opening SQL Server Configuration Manager again. This time, it was a success – the program opened, and we were able to look up the assigned TCP port for the offending SQL instance experiencing the database discovery failure. We repeated our test, manually running the SQL database discover script with the arguments supplied by the event log message, now adding in the TCP port, and successfully received a page of dataitems containing information about the installed databases.
Since we had found and resolve an issue with the WMI provider on the SQL server, the next step was to confirm whether or not that solved our root issue – the lack of TCP port data being discovered and present within Operations Manager, thus causing the initial database discovery failure that led us down this road. We flushed the health service state cache on the agent, forcing it to reload configuration and re-run discoveries. Sure enough, within five minutes we saw that the SQL instance discovery had completed and Operations Manager had successfully found the TCP port for the instance. Even better, within another five minutes, all of the new databases on the instance were discovered by Operations Manager and began to be monitored. Problem solved!
So, to summarize – if Operations Manager is failing to discover new databases on a monitored SQL instance, check the instance’s properties to confirm whether or not the TCP port is listed for the instance. If not, try opening SQL Server Configuration Manager on the affected instance and if you encounter the error listed above, follow the steps in the KB article to recompile SQL’s MOF files and fix the broken WMI provider.
Hopefully this info helps someone else out there, or at least provides a good illustration of how to troubleshoot missing data and/or failing workflows in Operations Manager. As always, if you have any questions, please post them in the comments below.