Extreme slowness in the SCSM console combined with a flood of 33610 events can sometimes be caused by communication issues between the SCSM management servers and the SQL servers; enabling SQL’s Named Pipes setting for the SQL instances can quickly alleviate the issue.
Recently I was called into a client to troubleshoot an issue with their SCSM environment. Overnight they suddenly started experiencing delays lasting 1-6 minutes whenever they’d try to create a new work item or update an existing work item. Needless to say, this rendered the program nearly unusable and had a massive impact on their service desk. They initially tried increasing the memory and CPU of their SQL server in an attempt to overpower the issue, but that didn’t help – the slowness vanished when they restarted the server, but returned within 24 hours. After further attempts to resolve the issue internally, they decided it was time for Model to give it a shot and called us in to investigate.
The troubleshooting steps and eventual resolution provide good insight into how to troubleshoot SCSM performance issues as well shine a light on an uncommon issue that is definitely worthy of more blog attention. Read on to learn more.
Troubleshooting and Investigation
When I arrived at the client, the problem was in full force – SCSM was taking around 5 minutes to save any changes to a work item during which time the whole program just hung. As I started my investigation, here’s what I knew:
- The issue manifested in extreme slowness in the SCSM console.
- The client had no knowledge of what change may have been made to cause the problem.
- Restarting the SQL instance mitigated the problem, but it would return within 24-48 hours.
- Adding resources to SQL didn’t help.
Knowing that SQL performance is the primary bottleneck for SCSM console performance, I knew that SQL was going to be the main target of my investigation. Before I looked into it, though, I wanted to confirm whether or not there was any additional information on the SCSM management servers. As such, I logged into the management servers and took a peek at the OperationsManager event log. I found that the logs were full of Warning events with the ID 33610, all citing “Subscription query is taking too long”, with new events being generated every few seconds. It appeared that every single workflow in SCSM was experiencing massive delays when waiting for query results from SQL. This confirmed that the problem was going to be somehow related to SQL server; my next step was to find out exactly where.
I next remoted into the SQL server hosting SCSM’s operational database (default name: ServiceManager). The first thing I did was open up Task Manager to get a quick look at the current resource utilization. SQL was using all the available memory it could, which is unsurprising as its default configuration tells it to do so and often times (at least in my experience) that setting doesn’t get modified. Aside from that, though, everything looked fine – there was barely any CPU performance and the ethernet connection had more bandwidth available than not.
Next, I pulled up Performance Monitor and loaded up several SQL performance counters to check for memory pressure. Specifically, I checked the Buffer Cache Hit Ratio and Page Life Expectancy counters, both found under SQLServer:Buffer Manager. These are both used to determine whether or not the SQL server is experiencing signs of memory pressure, in other words, whether or not enough memory has been allocated or if the server actually needs more resources. Buffer Cache Hit Ratio is a counter with a value ranging from 0 to 100 which represents the percent of page requests being read from the cache (in memory) versus being saved to and read from disk. If this is at or close to 0, it implies more memory is needed by the server. Additionally, Page Life Expectancy is a measure of how long pages live in the cache (in memory), in seconds. This number should always be 300 or more. If it is less than 300 (seconds), then pages are being swapped out too quickly, implying that not enough memory is available for SQL’s workload and performance is impacted.
In the case of my client, both of these numbers were great – Buffer Cache Hit Ratio was holding steady at 100 and Page Life Expectancy was in the thousands. There was no memory pressure and no resource contention on the SQL server contributing to the SCSM console extreme slowness.
With no processor or memory throttling occurring, the next thing to investigate was the disks supporting the SQL server, as SQL disk I/O is a major bottleneck for SCSM performance. The fact that everything was working perfectly one day and awfully the next, though, combined with the fact that the problem disappeared when SQL restarted only to come back a few days later, made me pretty confident that the problem wasn’t disk-related. Sure enough, after running SQLIO on the disks, we confirmed that they were still providing excellent IOPS. Additionally, there were no events in the operating system or in the other hardware monitoring tools showing any issues with the disks. That problem was ruled out.
This left us in an interesting position. We had an issue where SQL was responding very slowly to queries, but showed no evidence on the server or within SQL of what might be causing it. The obvious remaining culprit was network, but ping times between the management servers and SQL server were all <1ms, and pure bandwidth would explain why the issue went away when the SQL service restarted and returned later.
Finding the Solution
During further furious investigation, I came across a blog post by Ian Blythe (link) with very similar symptoms wherein he claimed that enabling SQL server’s Named Pipes feature in the SQL server Network Configuration resolved his issue. This required some further investigation before trying it, though.
Google Bing “SQL Server Named Pipes” you’ll find that it is a very contentious subject with people arguing both for and against their usage depending upon circumstances. The gist of their function is that they provide an alternate method of communication with SQL instances which was originally designed for applications build around NetBIOS and other LAN-based protocols. This is disabled by default in SQL server these days (and has been since SQL Server 2005) in favor of the default TCP/IP communication. However, some applications still perform better with pipes enabled, and thus the feature can be turned on when necessary.
It must be said that, normally, SCSM does not require named pipes. None of the other many SCSM implementations I’ve stood up or worked with in the last five years have ever needed it. However, given the odd circumstances with this client’s environment, we decided to give it a shot. We hopped into SQL Server Configuration Manager, selected the SQL Server Network Configuration node, selected the Named Pipes option, and enabled it.
Within 15 minutes the stream of Event 33610s on the SCSM management servers had halted and the extreme slowness in the SCSM console had vanished, with performance returning to normal. While it remained unclear what the client had changed to cause the problem, it was evident that something must have been interfering with the TCP/IP communication between the SCSM management servers and the SQL server, and by enabling Named Pipes, we’d successfully implemented a workaround.
It should not be a standard configuration change for SCSM environments, but in cases of mysterious extreme slowness in the SCSM console combined with a stream of 33610 events and no evidence of resource contention on the SQL server, enabling SQL server’s Named Pipes feature in the network configuration may help alleviate the performance issues. Don’t rush to put this change into effect, but add it to your bag of tricks in case the situation ever occurs for you.
The goal of this post was to walk through the process of troubleshooting SCSM performance issues and give attention to a rare issue and fix within SCSM. Hopefully this helps you as it has helped me!